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一 、 引 子 : 关于 数据 利用 的 故事 


林 虞 发 现 敌 军 指挥 部 
在 这 个 故事 中 ， 指 挥 官 做 了 这 样 几 件 有 关联 的 事情 。 一 并 称 之 为 "数据 闭环 “。 


o 不 间断 的 收集 战场 数据 ; 

e 基于 某 个 指标 抽象 出 战场 实际 ; 

e 对 指标 进行 分 析 建 模 ， 发 现 一 个 机 会 ; 

e 基于 分 析 结 果 进 行 战场 决断 ， 获 取 最 大 利益 。 


将 指挥 官 换 成 CEO， 就 是 商业 数据 闭环 ， 基 本 上 所 有 的 商业 数据 应 用 ， 离 不 开 这 个 
BM © 


你 觉得 指挥 官 建立 了 什么 模型 ? ARE? MERA E 


TF A AR A 


e 度量 商业 行动 : 例如 用 一 个 指标 测量 用 户 响 应 率 

e 识别 商业 机 会 : 规划 光棍 节 新 产品 或 新 活动 ， 理 解 和 个 户 数 据 的 波动 ， 评 价 营 销 
ER 

e 将 数据 转 为 知识 : 通过 

gerent 通常 结 
TA E ay HE R 
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数据 利用 的 四 层 境 界 


e 数据 : 数据 底层 ， 原 始 数据 的 汪洋 大 海 。 形 态 : 数据 库 。 功 能 : 直接 取 数 

e 信息 : 基于 数据 提炼 得 到 的 指标 ， 新 客 有 多 少 ? 老 穴 有 多 少 ? 老 穴 都 有 什么 特 
征 ， 新 客 都 有 什么 特征 ?形态 : 汇总 报表 。 功 能 : 指标 提供 ， 回 答 过 去 已 经 发 
生 了 什么 的 问题 ， 业 务 人 员 运 用 得 当 也 可 以 解决 很 多 问题 。 

e 知识 : 基于 信息 建立 各 指标 之 间 的 关系 模型 。 什 么 情况 下 新 客 会 转化 为 老 客 ? 
模型 结果 。 形 态 : 模型 ， 功 能 : 回答 为 什么 的 问题 ， 解 释 关系 和 因果 ， 预 测 未 
来 


A 
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MAZO BA: 数据 产品 。 功 能 


AG 


数据 产品 


入 产品 。 我 们 要 怎么 做 ， 才 会 让 新 客 转 


: 控制 未 来 


e 数据 产品 就 是 给 决策 者 提供 行动 信息 的 载体 ， 例 如 


Amazon 的 商品 推荐 

天 气 预 报 

Stock Market Predictions 
Production Process Improvements 


O 


O 
O 
O 
o Health Diagnosis 


o Flu Trend Predictions 


洞察 。 
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有 些 看 起 来 也 能 提供 决策 者 行动 信息 ， 如 黄历 ， 星 相 ， 但 它们 不 是 基于 数据 的 


数据 挖掘 也 称 为 知识 发 现 。 是 一 个 去 粗 生 精 、 去 伪 行 中 的 过 程 。 是 从 大 量 数据 


中 提取 、 归 纳 有 用 知识 的 过 程 和 方法 。 将 其 用 于 决策 ， 可 以 提高 人 类 的 福利 。 


开 首 勒 三 大 定律 开 普 勒 的 老师 第 谷 收集 
过 研究 数据 找到 背后 的 规律 
几 个 相关 概念 


了 大 量 天 文 观测 数据 ， 但 却 是 开 普 勒 通 


o 

o 机 器 学 习 

o 统计 理论 

o 数据 科学 

o 模式 识别 

需要 算法 开发 不 需要 算法 开发 

需要 数据 开发 数据 科学 家 数据 控 握 工程 师 
不 需要 数据 开发 算法 工程 师 kaggle 玩 家 


三 、 数 据 挖掘 和 我 们 的 关系 


为 何 需要 数据 挖掘 : 


e 如 果 没 有 数据 ， 可 以 用 什么 决策 ? (直觉 ， 经 验 归 纳 ， 人 逻辑 推理 ， 算 命 ) 
o 需要 数据 ， 因 为 数据 就 是 现实 世界 的 历史 痕迹 ， 需 要 通过 各 种 痕迹 来 推断 
RR > RUM CAL SERPA VIA © 
数据 太 多 ， 人 脑 无 法 直接 处 理 。 记 录 的 数据 越 来 越 多 ， 形 式 和 来 源 都 越 来 
越 复 杂 。 自 然 产生 的 数据 ， 人 类 社会 产生 的 数据 (社交 网 络 ， 文 本 ， 图 
Boo EF? MAA) 
o 所 以 需要 挖掘 工具 和 方法 的 帮助 。 


数据 控 握 为 什么 火 ? 


e 当前 的 各 项 前 提 条 件 已 经 具备 。 
o 硬件 价格 的 下 降 ， 使 数据 的 存储 和 运算 成 本 更 低 。 
o 个 人 和 创业 公司 得 以 进入 数据 领域 。 
o 开源 软件 工具 和 公开 课 分 至 使 跨 界 更 为 容 荔 。 
o 不 同学 科 的 壁垒 被 打破 ， 可 以 较为 容易 的 获得 并 学 习 其 它 学 科 的 知识 和 工 
具 ， 成 为 专业 余人 士 。 


WATER VE 4T 2038 


COMPE A 搭建 ， 数 据 报 表 开 发 ， 数 据 产 品 开 发 ， 例 如 淘宝 的 
数据 麻 方 的 开发 工作 ， 这 类 工作 需要 很 强 的 软件 开发 背景 。 

e 模型 : 侧重 于 对 数据 的 研究 ， 用 统计 理论 或 机 器 学 习 的 方法 对 数据 进行 分 析 建 
模 ， 例 如 广告 的 点 击 率 分 析 建 模 ， 这 类 工作 需要 丰富 的 统计 理论 和 模型 算法 知 
TA o 

e KF: 侧重 于 对 数据 的 创作 ， 用 WEB 技 术 进 行 数据 可 视 化 或 者 制作 信息 图 ， 例 
如 卫 报 的 数据 网 站 ， 需 要 很 强 的 可 视 化 能 力 和 前 端 技术 。 

e 价值 : 侧重 于 数据 中 包含 的 商业 价值 研究 ， 强 调 对 专业 领域 的 业务 理解 和 交流 
沟通 ， 例 如 咨询 公司 发 布 的 商业 分 析 报 告 ， 需 要 广泛 的 业务 知识 和 商业 敏感 


领域 有 哪些 : 


e 文字 语音 识别 


。 下 棋 。。。 


4, A ERE TER ? 


e 一 个 关于 模型 的 浅显 例子 AMIA SR TF 8] VS SE TRE? 
e 可 能 的 方案 : 


o 准备 N 个 西瓜 
o 设计 M 个 变量 或 指标 〈 特 征 工程 ) > EE AO LA TIA ER 
的 位 置 ...， 切 开 前 记录 这 些 指标 ， 记 为 X。 

o 切 开 后 让 hn 个 人 品尝 打分 (0 衣 示 不 甜 ，1 表 示 甜 )， 亿 为 Y。 

o 使 用 其 中 一 部 分 数据 ， 结 合 分 类 算法 对 X 和 YY 的 关系 进行 建 模 。 

o 用 剩 下 另 一 部 分 数据 ， 检 查 模型 的 效果 。 

o 把 模型 的 逻辑 写成 一 个 APP 放 到 应 用 市 场 上 ， 持 续 收 获 数据 ， 改 进 模 型 。 
e 一 个 复杂 的 例子 ， 如 何 判断 一 个 老人 未 来 是 否 会 得 痪 采 症 〈 思 考 思 考 ) 


商业 中 使 用 模型 的 例子 : 
e 银行 的 信用 卡 发 放 : 


银行 在 信用 卡 发 放 的 时 候 会 进行 审核 。 审 核 某 个 人 的 资格 是 否 符合 条 件 以 授信 。 在 
传统 的 审核 工作 ， 这 种 事 是 人 工 来 做 的 ， 申 请 人 卉 一 张 表 ， 写 上 个 人 的 年 龄 、 职 
业 、 收 入 等 信息 (X 变 量 ) 。 交 给 有 经 验 的 银行 风 控 师 ， 他 们 来 进行 评价 ， 是 发 信 
用 卡 ， 还 是 不 发 信用 卡 。 


但 在 互联 网 时 代 ， 这 种 人 工 审 核 就 太 慢 了 。 互 联网 金融 的 崛起 就 是 最 明显 的 趋势 。 
它 将 全 网 中 关于 个 人 的 行为 数据 进行 收集 整合 ， 其 中 有 必然 有 一 部 分 人 已 经 在 金融 
机 构 有 过 借贷 行为 ， 考 察 这 种 行为 是 否 有 违约 ， 将 其 作为 Y。 将 其 它 的 行为 数据 作 
为 X。 这 样 就 构成 了 一 个 可 以 喂 到 分 类 算法 中 的 数据 集 。 然 后 这 个 模型 就 可 以 用 在 
未 来 的 申请 人 身上 ， 形 成 审核 目 动 化 系统 。 


e gmail 垃 圾 邮件 自动 分 类 : 


如 果 你 点 开 目 己 的 gmail 邮 箱 ， 人 和 仔细 观 紧 会 发 现 一 个 垃圾 邮件 的 标签 ， 它 平时 默默 的 
为 你 挡 下 大 量 的 垃 援 信息 ， 而 又 不 去 干扰 你 ， 实 在 是 数据 产品 的 典 施 。 那 么 如 果 你 
来 做 这 种 垃圾 邮件 的 自动 分 类 ， 要 怎么 做 的 呢 ? 


如 果 我 们 考虑 简单 些 ， 抛 开 一 封 电邮 中 的 其 它 信息 (发 件 人 ，IP..) » MARIA 


Az M, 


的 话 ， 这 个 问题 就 转 为 一 个 文本 分 类 的 问题 。 


文本 的 分 析 难 点 在 于 : 文本 不 是 给 计算 机 阅读 的 ， 它 有 复杂 的 语言 结构 (语法 、 语 
LGA) ， 但 语言 中 依然 存在 统计 规律 (统计 语言 模型 ) o 


一 个 简单 的 文本 分 类 模型 : 判断 一 封 邮 件 是 否 垃 圾 邮件 


收集 N 个 邮件 

从 邮件 中 提取 指标 〈 分 词 ， 空 间 向 量 模型 ) ， 构 成 文档 - 词 项 矩阵 
人 工 标 注 这 些 邮 件 是 否 垃 圾 邮件 

用 一 部 分 数据 ， 结 合算 法 对 X 和 Y 的 关系 进行 建 模 。 

用 剩 下 的 数据 ， 检 查 模 型 的 效果 。 


五 、 数 据 挖 振 日 第 工作 有 哪些 


项 目 讨论 和 规划 ， 就 是 开会 。 这 方面 工作 目的 主要 是 明确 业务 问题 。 是 不 是 可 
以 做 ? 大 概 可 以 怎么 做 ? 确定 了 业务 问题 之 后 ， 需 要 将 这 个 业务 问题 翻译 成 一 


个 数据 问题 。 


项 目 准 备 ， 准 备 开 工 干 活 了 。 这 方面 是 最 为 繁琐 也 最 容易 出 错 的 地 方 。 需 要 和 
数据 仓库 的 同学 配合 取得 必要 的 数据 ， 探 索 理 解数 据 的 业务 意义 ， 评 估 数 据 质 
量 ， 根 据 项 目 需 要 对 数据 进行 整理 转换 ， 做 大 量 的 特征 工程 的 工作 。 


项 目 实施 ， 即 数据 建 模 ， 开 始 拷 打数 据 了 。 选 择 尝试 不 同 的 模型 算法 ， 从 数据 
中 得 到 需要 的 结果 ， 然 后 从 不 同方 面 评价 效果 怎么 样 。 


项 目 结 束 ， 交 付 结果 。 确 定 模 型 如 何 部 署 ， 并 实施 部 署 工 作 。 这 种 部 署 就 是 模 
型 的 应 用 ， 多 数 情 况 下 是 将 结果 回 写 到 数据 库 中 。 同 时 结果 交付 给 需求 方 ， 写 
最 终 的 项 目 报告 ， 归 档 所 有 文件 。 


阅读 文献 ， 方 法 研究 。 在 比较 空闲 的 时 间 ， 或 者 遇 到 难题 的 时 候 ， 都 需要 去 找 
巨人 的 月 膀 依 短 一 下 。 


一 个 典型 的 步骤 流程 


商业 理解 : 理解 业务 目标 和 需求 ， 并 转化 为 数据 挖掘 可 理解 的 问题 定义 。 建 模 
师 会 参加 业务 组 的 会 议 ， 主 要 是 了 解 收 集 业 务 需 求 。 


e 数据 理解 : 算 选 目标 数据 ， 检 验 数据 质量 ， 探 索 数 据 特征 ， 评 估 可 用 数据 。 建 


模 师 会 将 一 些 初步 结果 呈现 给 业务 组 ， 得 到 进一步 反馈 。 

e 数据 准备 : 通过 清洗 ， 集 成 ， 变 换 ， 集合 。 建 模 
Jp FP 46 PEAY 5 SOLE 89 Bp ALARE o 

e 模型 建立 : 选择 和 应 用 各 种 机 器 学 习 或 统计 方法 、 构 建 模 型 并 调 校 各 种 参数 。 
建 模 师 进入 炼丹 阶段 ， 期 待 能 有 好 的 结果 

e 模型 评价 : 结合 最 初 的 商业 目标 评价 并 解释 模型 ， 评 估 其 可 能 的 商业 效果 。 建 
pride EA 3E 77 73 o 

e 模型 部 署 : 按 用 户 习 惯 方式 实施 并 发 布 模型 ， 提 供 分 析 结 论 ， 并 持续 跟踪 。 建 
" 师 将 模型 上 线 ， 监 测 性 能 。 


六 、 数 据 挖 气 的 任务 模式 


e 7% Classification [Predictive] 


e XX Clustering [Descriptive] 

e 关联 规则 Association Rule Discovery [Descriptive] 
e Ẹ \424% Sequential Pattern Discovery [Descriptive] 
e 回归 Regression [Predictive] 


e N] Deviation Detection [Predictive] 
分 类 方法 的 应 用 


精准 化 营销 


e 问题 ; 准备 发 售 iphone 新 品 了 ， 哪 些 用 户 可 能 会 买 ? 

e FT. 
o 找到 相似 产品 的 用 户 行 为 数据 o 目标 变量 
o 收集 这 些 用 户 的 行为 特征 ， 作 为 模型 的 解释 变量 


API: 


@ 问题 : wT] APTE TA RETIRAR PRATES TE 
e 方法 : 
o 收集 用 户 的 基本 社会 特征 和 行为 特征 


o 根据 用 户 的 相似 程度 进行 聚 类 分 组 
o 根据 同 组 的 用 户 购买 行为 判断 分 群 的 效果 
关联 规则 应 用 
相似 商品 推荐 
e 关联 规则 是 一 种 规律 {面包 干 , ... } --> (EB) 


e 买 了 面包 和 干 的 用 户 往往 也 会 去 买 著 厅 ， 可 以 利用 这 种 关联 规则 ， 将 


起 放 在 货架 上 捆绑 销售 
。 不 同 的 关联 商品 ， 暗 示 了 不 同 的 消费 场景 


异常 检测 应 用 
。 从 正常 行为 中 发 现 不 正常 的 行为 模式 


e 信用 卡其 诈 


七 、 需 要 学 握 哪些 技能 


o 有 形 的 技能 ; 


o 理论 : ^um o 例如 统计 理论 、 机 器 学 习 算 法 。 个 人 体会 精通 理论 后 再 做 数 
据 工 作 就 如 汤 泼 雪 。 我 也 承认 学 习 理 论 是 艰难 的 ， 但 是 一 定 要 在 年 轻 的 时 
具体 的 做 事 


候 读 最 难 的 书 。 《数学 之 美 》 中 谈 到 ， 技 术 分 为 本 和 道 两 种 ， 


若干 商品 一 


方法 是 术 ， 做 事 的 原理 和 原则 是 道 ， 只 追求 术 的 人 工作 很 辛苦 ， 只 有 掌握 


了 道 才 能 永远 游 丸 有 余 。 


o 工具 | 剑 宗 。 理 论 不 用 在 产品 上 就 定 王 语 芍 的 学 院 派 。 从 理论 到 产品 ， 需 


要 掌握 各 种 工具 。 这 类 工具 用 得 热 了 能 事半功倍 ， 例 如 R 、 python 、 


SQL 、hadoop 这 类 。 学 习 工 具 和 学 习 语 言 一 样 ， 都 要 多 读 多 


> 1217 dm 


摩 ， 就 可 以 运用 自如 。 不 过 迷信 工具 Mii 
有 最 合适 的 工具 。 ee ee 


败 ， 可 以 玩 绣花 针 。 


o 经 验 : 实战 。 有 内 力 有 剑 法 ， 就 需要 下 山 了 。 对 战 最 强悍 的 对 手 ， 才 能 让 


你 的 内 力 剑 法 融 为 一 体 。 做 项 目 ， 在 工作 解决 难题 ， 才 是 长 进 


e 无 形 的 气质 : 


取 快 的 。 


o 好奇， 好奇 心 和 兴趣 是 从 数据 中 得 到 洞察 的 驱动 力 。 有 好 奇 心 的 人 才 会 对 
数据 有 持续 的 热情 。 

o 创造 ， 兵 无 第 势 ， 数 据 的 工作 都 是 千差万别 的 ， 虽 然 可 以 依靠 一 些 老 的 经 
验 做 些 照 猫 画 虎 的 事 。 但 最 好 还 是 需要 根据 不 同 的 项 目 情况 来 做 出 判断 。 
独立 思考 和 创造 让 你 走 得 更 远 。 

o 求 败 ， 创 造 、 前 沿 、 探 索性 的 工作 ， 一 定 会 有 失败 ， 快 速 失败 ， 快 速 学 
习 ， 不 断 修 正 ， 能 够 败 中 求 胜 。 


如 何 培 养 数据 控 握 的 技能 


e AAF: 自学 。 “知识 与 耐心 ， 是 击败 强 者 的 唯一 方法 。” 
o 通过 阅读 来 学 习 。 包 括 了 阅读 经 典 的 理论 教材 、 代 码 、 论 文 、 上 公开 课 。 
o 通过 牛人 来 学 习 。 和 包括 同行 的 有 聚会、 讨论 、 大 和 牛 的 博客 、 微 博 、twitter ^ 


RSS ° 
o 通过 练习 来 学 习 。 包 括 代码 练习 题 、 参 加 kaggle 比 赛 、 解 决 实际 工作 中 的 
MERI o 
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过 分 至 来 学 习 。 包 括 上 自己 写 笔记 、 写 博客 、 和 同事 分 享 交 流 、 培 训 新 


An 


八 、 经 验 之 谈 


建 模 中 的 坑 
e 建 模 过 程 的 问题 


o 缺乏 业务 问题 的 沟通 和 理解 
o 只 关注 训练 数据 或 只 过 于 相信 数据 
o A TR RR 于 一 种 技术 
o 错误 的 变量 输入 
e FEBRES 
o 模型 结果 不 代表 任何 规律 
o 模型 训练 集 可 能 不 反映 旨 正 的 总 体 
o 数据 的 详细 程度 有 误 
e 挖掘 结果 没有 用 


o 挖掘 结果 众所周知 


o 挖掘 结果 不 可 用 于 决 宋 


^g 
一 把 心得 


e 提问 题 比 回答 问题 更 重要 一 个 具体 的 业务 痛 点 是 数据 挖 气 的 起 点 ， 精 心计 划 流 


程 步骤 ， 业 务 知识 员 穿 挖 握 建 模 的 每 个 阶段 。 
对 数据 持 谨 懂 的 态度 数据 很 可 能 出 错 ， 数 据 整理 占据 大 部 分 的 工作 时 间 。 
数据 本 身 仅 能 用 于 描述 历史 不 能 展现 因果 ， 也 不 能 预知 未 来 。 


数据 价值 体现 在 落地 应 用 数据 挖 握 价 值 并 不 取决 于 模型 的 准确 或 稳定 ， 取 决 于 
Ed B o 


不 同 的 指标 和 模型 都 有 其 适用 范围 随时 间 环境 变化 ， 所 有 的 模式 都 会 改变 ， 不 
断 尝试 ， 不 断 修正 。 


工作 中 的 文档 化 和 自动 化 


一 个 综合 案例 的 建 模 步 


[2] 


问题 定义 
数据 探索 
特征 工程 
建 模 和 评估 


DEL 


我 们 使 用 kaggle 上 一 个 经 典 的 问题 做 为 案例 示范 ， 即 判断 一 个 贷款 者 在 后 续 两 年 内 


ABE 


ib 29 LAE o 


需要 思考 的 关键 问题 


损失 是 如 何 发 生 的 ? 
违约 的 人 有 哪些 特点 ? 
违约 的 占 比 有 多 少 ? 

如 何 能 改善 我 们 的 损失 ? 


AE TR 


http: //www.kaggle.com/c/GiveMeSomeCredit 


变量 的 意义 


e SeriousDlgin2yrs 
o 用 户 在 后 续 两 年 内 出 现 90 天 以 上 的 还 款 作 期 ， 这 是 目标 变量 ， 以 下 都 是 解 
ERE 
e RevolvingUtilizationOfUnsecuredLines 
o 信用 卡 借款 占 比 
e age 
o 借款 人 年 龄 
e NumberOfTime30-59DaysPastDueNotWorse 
o THA MT 3430-59 X 3f BI HE 
e DebtRatio 
o 月 度 生活 成 本 占 月 收入 的 比率 
e Monthlylncome 
o 月 收入 
e NumberOfOpenCreditLinesAndLoans 
o DEEM RAR AE 
e NumberOfTimes90DaysLate 
o 过 去 用 期 90 天 以 上 还 款 的 次 数 
e NumberRealEstateLoansOrLines 
o 房贷 的 信贷 次 数 
e NumberOfTime60-89DaysPastDueNotWorse 
o i + MIAGO-89 X A 3 hI K A 
e NumberOfDependents 
o 家 庭 中 需要 抚养 者 的 人 数 


import pandas as pd 

import numpy as np 

import seaborn as sns 

import matplotlib.pyplot as plt 
%matplotlib inline 


df = pd.read_csv("data/credit-training.csv" ) 
df .head() 


SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines 


0 1 0.766127 
1 0 0.957151 
2 0 0.658180 
3 0 0.233810 
4 0 0.907239 
df.shape + 


(150000, 11) 


数据 探索 


df. info() 


age 


45 
40 
38 
30 
49 


<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 150000 entries, O to 149999 


Data columns (total 11 columns): 
SeriousDlqin2yrs 
RevolvingUtilizationOfUnsecuredLines 
age 
NumberOfTime30-59DaysPastDueNotWorse 
DebtRatio 

MonthlyIncome 
NumberOfOpenCreditLinesAndLoans 
NumberOfTimes90DaysLate 
NumberRealEstateLoansOrLines 
NumberOfTime60-89DaysPastDueNotWorse 
NumberOfDependents 

dtypes: float64(4), int64(7) 


memory usage: 12.6 MB 


df.describe() 


SeriousDlqin2yrs 


150000 
150000 
150000 
150000 
150000 
120269 
150000 
150000 
150000 
150000 
146076 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


int64 
float64 
int64 
int64 
float64 
float64 
int64 
int64 
int64 
int64 
float64 


RevolvingUtilizationOfUnsecuredLines 


count 150000.000000 150000.000000 
mean 0.066840 6.048438 

std 0.249746 249.755371 

min 0.000000 0.000000 

25% 0.000000 0.029867 

50% 0.000000 0.154181 

75% 0.000000 0.559046 

max 1.000000 50708.000000 


df.SeriousDlqin2yrs.value counts() 


150 
02.2 
14.1 
0.0C 
41.C 
92.( 
63.( 
109 


0 139974 
1 10026 
Name: SeriousDlqin2yrs, dtype: int64 


df.SeriousDlgin2yrs.mean( ) 
0.06684 
df.NumberOfDependents.unique() 


array([ 2., d 0., nan, Say 
ALS , 
10., 9., 13.]) 


df.NumberOfDependents.value counts() 


0.0 86902 
1.0 26316 
2.0 19522 
3.0 9483 
4.0 2862 
5:0 146 
6.0 158 
1.0 51 
8.0 24 
10.0 5 
9.0 5 
20.0 1 
13.0 al 


Name: NumberOfDependents, dtype: int64 


df.groupby("SeriousDlqin2yrs").mean() 


RevolvingUtilizationOfUnsecuredLines age 

SeriousDlqin2yrs 
0 6.168855 52.751375 
1 4.367282 45.926591 


pd.value_counts(df.NumberOfDependents).plot(kind='bar'); 


20000 
10000 | 
7 a EN = 
e c3 ce e e c 
co — Es] [un | “+ 


c c c 2 
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pd.crosstab(df.NumberOfTimes90DaysLate, df.SeriousDlqin2yrs) 
# 计算 交叉 频数 表 


SeriousDlqin2yrs 


NumberOfTimes90DaysLate 


0 


O o N O OC a WO N 一 


135108 
3478 
119 
282 


121 


pd.crosstab(df.age, df.NumberOfDependents ) 


NumberOfDependents 
age 
0 
21 


00 10 20 30 40 


9.0 


6554 
1/65 
776 
385 
195 


143 


6.0 


7.0 


22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 


74 


385 

550 

689 

774 

946 

1001 
1142 
1195 
1337 
1314 
1207 
1254 
1152 
1136 
1139 
1088 
1113 
1202 
1185 
1193 
1159 
1205 
1209 
1363 
1373 
1439 
1509 
1632 


1164 


128 
192 
210 
254 
288 
344 
380 
449 
389 
418 
445 
488 
484 
544 
570 
517 
526 
556 
574 
608 
695 
672 
771 
717 


184 


114 
145 
1/8 
245 
275 
315 
360 
398 
486 
941 
604 
691 
158 
191 
755 
789 
054 
028 
8/6 
863 
191 
790 


co Aa A Ol 


5 | 2 | © 


N O O 


— 


Y O N Bb Cc Bb WON OC CO 
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c O RM O Rem O Rea O 


— 


Oo WC OO WO W = N Bb MN Oo 


981 
933 
890 
837 


75 
76 
77 
78 
79 
80 


770 


629 


81 


927 
417 


82 
83 


406 
403 


84 
85 


318 


86 
87 


219 
244 
230 


88 
89 


90 


148 
119 
75 
67 


91 


92 


93 


33 
33 


94 
95 


12 


11 


96 


97 


98 


99 
101 


102 
103 


107 


84 rows * 13 columns 
清洗 数据 


import re 
# 将 名 字 都 改 为 snake case 
def camel to snake(column name): 


converts a string that is camelCase into snake case 

si = re.sub('(.)([A-Z][a-z]*)', r'\1_\2', column name) 

return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower() 
camel to snake("javaLovesCamelCase") 


Java. loves camel case' 


df.columns - [camel to snake(col) for col in df.columns] 
df.columns.tolist() 


['serious dlqgin2yrs', 
revolving utilization of unsecured lines', 
'age', 
"number of time30-59 days past due not worse', 
'debt ratio', 
"monthly income', 
"number of open credit lines and loans', 
"number of times90 days late', 
'number real estate loans or lines', 
"number of time60-89 days past due not worse', 
"number of dependents'] 


from sklearn.neighbors import KNeighborsRegressor 
income_imputer = KNeighborsRegressor(n_neighbors=1) 


an SE IAS xp Sa A A A 를. A RI A Il X kh 4 Ah ÆL 4e 2 > Kia] AH tr A ÆL 4E X 
# 数据 分 为 两 部 分 ， 有 缺失 的 和 无 缺失 的 ， 用 无 缺失 的 数据 建立 模型 来 判断 缺失 数据 的 
T 2% H3 
| 8678 


train w monthly income = df[df.monthly_income.isnull()==False | 
train w null monthly income = df[df.monthly income. isnull()==True 


] 
kg 


cols = ['number_real_estate_loans_or_lines', 'number_of_open_cre 
dit lines and loans'] 

income imputer.fit(train w monthly income[cols], train w monthly 
 income.monthly income) 


u4 J È SE. žy) B lor BASSE 353024 E 1 
F 用 FT MUAK IRALA A 未 26 MR MARAR LAR VI) ZR $ Ux A 


KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='mink 
Owski', 
metric_params=None, n_jobs=1, n_neighbors=1, p=2, 
weights-'uniform') 


new values - income imputer.predict(train w null monthly income[ 
cols] ) 


# 再 用 模型 预测 缺失 值 中 的 月 收入 


train w null monthly income.ix[:, monthly income' |=new values 
# imputation 


df imputed = train w monthly income.append(train w null monthly 
income) 
df imputed.shape # +£ At X44 done 


(150000, 11) 


df imputed.ix[df imputed.number of dependents.isnull(), number o 


f dependents'] = -1 


df imputed.info() 


«class 'pandas.core.frame.DataFrame'> 
Inte4Index: 150000 entries, O to 149997 
Data columns (total 11 columns): 

serious dlgin2yrs 

nt64 

revolving utilization of unsecured lines 
loat64 

age 

nt64 

number of time30-59 days past due not worse 
nt64 

debt ratio 

loat64 

monthly income 

loat64 

number of open credit lines and loans 
nt64 

number of times90 days late 

nt64 

number real estate loans or lines 

nt64 

number of time60-89 days past due not worse 
nt64 

number of dependents 

loat64 

dtypes: float64(4), int64(7) 


memory usage: 13.7 MB 


特征 工程 


150000 


150000 


150000 


150000 


150000 


150000 


150000 


150000 


150000 


150000 


150000 


non-null 


non-null 


non-null 


non-null 


non-null 


non-null 


non-null 


non-null 


non-null 


non-null 


non-null 


df_imputed.monthly_income.hist(); 


def cap_values(x, cap): 
if X > cap: 
return cap 
else: 
return x 
# Te LER 
df_imputed.monthly_income = df_imputed.monthly_income.apply( lamb 
da x: cap values(x, 15000) ) 


# 变量 离散 化 ， 分 为 15 个 bin 

df_imputed['income bins'] = pd.cut(df imputed.monthly income, bi 
ns=15, labels=False) 

pd.value_counts(df_imputed.income_bins) 


3 23168 
4 19944 
5 15583 
6 14475 
2 14038 
7 10766 
8 8609 
14 (115 
9 1612 
1 7504 
10 6298 
0 4994 
11 4454 
12 2547 
13 2173 


Name: income_bins, dtype: int64 


df imputed[["income bins", "serious dlqin2yrs"]].groupby("income 
_bins").mean() 
# 每 个 月 收入 分 类 中 统计 defaulLt 和 平均 频数 


# 画 出 图 来 ， 可 以 明显 发 现在 1-2 个 bin 中 的 default 最 多 


cols = ["income bins", "serious dlgin2yrs"] 


income bins 


0 


O o N O OC a WO N 一 


一 
e 


= | | | od 
bb c N = 


0.051862 
0.104211 
0.093674 
0.084168 
0.073305 
0.066033 
0.059067 
0.054338 
0.050877 
0.048488 
0.039695 


0.037270 
0.040832 
0.042338 
0.047203 


serious dlqin2yrs 


df imputed[cols].groupby("income bins").mean().plot(); 





=== senous diginzyrs 


# 以 年 龄 来 看 是 20-30 岁 最 高 ， 男 外 就 是 100 岁 前 后 ， 可 能 是 去 世 了 ? 
cols = ['age', 'serious_dlgin2yrs'] 

age means = df_imputed[cols].groupby("age").mean( ) 
age_means.plot(); 


=== senous dk in2yrs 





mybins = [0] + range(20, 80, 5) + [120] 
df_imputed[ 'age_bin'] = pd.cut(df_imputed.age, bins=mybins) 
pd.value counts(df imputed['age bin']) 


(45, 50] 18829 
(50, 55] 17861 
(55, 60] 16945 
(60, 65] 16461 
(40, 45] 16208 
(35, 40] 13611 
(65, 70] 10963 
(30, 35] 10728 
(75, 120] 10129 
(25, 30] 7730 
(70, 75] 7507 
(20, 25] 3027 
(0, 20] 0 


dtype: int64 


from sklearn.preprocessing import StandardScaler 


df imputed['monthly income scaled'] = StandardScaler().fit trans 
form(df imputed.monthly income.reshape(-1,1)) 


建 模 和 评估 


df_imputed.columns 


Index([u'serious_dlqin2yrs', u'revolving_utilization_of_unsecure 
d_lines', 
u'age', u'number of time30-59 days past due not worse', u 
'debt ratio', 
u'monthly income', u'number of open credit lines and loan 
S', 
u'number of times90 days late', u'number real estate loan 
S or lines', 
u'number of time60-89 days past due not worse', u'number 
of dependents', 
u'income bins', u'age bin', u'monthly income scaled'], 
dtype='object' ) 


BX 4E AE ABS vo fo 16 RA 


# 特征 

features = ['revolving utilization of unsecured lines', 
‘age’, 
"number of time30-59 days past due not worse', 
'debt ratio', 
"monthly income', 
"number of open credit lines and loans', 
"number of times90 days late', 
'number real estate loans or lines', 
"number of time60-89 days past due not worse', 
"number of dependents', 
'income bins', 
'age bin', 
"monthly income scaled'] 


X = pd.get dummies(df_imputed[features], columns = ['income bins' 
, 'age bin']) 
# dummy var 


HERES (f 


print X.columns.tolist() 
print X.shape 


28 


['revolving utilization of unsecured lines', 'age', 'number of t 
ime30-59 days past due not worse', 'debt ratio', 'monthly income 
', number of open credit lines and loans', 'number of times90 d 
ays late', 'number real estate loans or lines', 'number of time6 
0-89 days past due not worse', 'number of dependents', 'monthly 
income scaled', 'income bins 0 ， 'income bins 1', 'income bins 2 
', income bins 3', 'income bins 4', 'income bins 5', 'income bi 
ns 6', 'income bins 7', 'income bins 8', 'income bins 9', 'incom 
e bins 10', 'income bins_11', 'income bins 12', 'income bins 13' 
, income bins 14', 'age bin (0, 20]', 'age bin (20, 25]', 'age_ 
bin (25, 30]', 'age bin (30, 35]', 'age bin (35, 40]', 'age bin. 
(40, 45]', 'age bin (45, 50]', 'age bin (50, 55]', 'age bin (55, 
60]', 'age bin (60, 65]', 'age bin (65, 70]', 'age bin (70, 75] 
', 'age bin (75, 120]'] 
(150000, 39) 


y = df imputed.serious dlqin2yrs 


from sklearn.cross validation import train test split 
train X, test X, train y, test y - train test split(X, y ,train 
size=0./,random_state=1) 


print train_X.shape 
print test_X.shape 


(105000, 39) 
(45000, 39) 


from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier, GradientBoo 
stingClassifier 


clf = LogisticRegression() 
clf.fit(train X,train y) 


LogisticRegression(C-1.0, class weight=None, dual-False, fit int 


ercept-True, 


intercept scaling-1, max iter-100, multi class-'ovr', 


n jobs-1, 


penalty='12', random state=None, solver='liblinear', t 


o1=0.0001, 
verbose=0, warm_start=False) 


clf.predict_proba(test_X) 


array([[ 0.92834155, 0.07165845], 
| 0.96492419, 0.03507581], 
| 0.95753197, 0.04246803], 
| 0.96021551, 0.03978449], 
| 0.87166534, 0.12833466], 
[ 0.9777555 , 0.0222445 ]]) 


from sklearn.metrics import roc_curve, roc_auc_score, 


tion_report, confusion_matrix 


preds = clf.predict(test_X) 
confusion matrix(test y, preds) 


array([[41880, 88], 
[ 2908, 124]]) 


classifica 


print classification_report(test_y, preds, labels=[0, 1]) 


precision recall f1-score support 

0 0.94 1.00 0.97 41968 

1 0.58 0.04 0.08 3032 

avg / total 0.91 0.93 0.91 45000 


pre = clf.predict_proba(test_X) 
roc_auc_score(test_y,pre[:,1]) 


0./0/8206950866951 


fpr, tpr, thresholds = roc curve(test y,pre[:,1]) 
plt.plot(fpr,tpr,); 





clf = RandomForestClassifier() 

clf.fit(train X,train y) 

preds - clf.predict(test X) 

print classification report(test y, preds, labels=[0, 1]) 


precision recall f1-score support 


0 0,94 0.99 0.97 41968 
1 Oh eal 0.15 0.24 3032 
avg / total 0,91 0.93 0.92 45000 


pre - clf.predict proba(test X) 
roc auc score(test y,pre[:,1]) 


0./86/644886115015 


练习 : 你 来 改进 这 个 控 气 的 结果 ， 使 auc 可 以 比 
0.78 更 大 


df imputed.to csv('df imputed',index = False) 





Sections 


e Exploring and visualizing the Housing dataset 
e Implementing a simple regression model - Ordinary least squares 
o Solving regression parameters with gradient descent 
o Estimating coefficient of a regression model via scikit-learn 
e Fitting a robust regression model using RANSAC 
e Evaluating the performance of linear regression models 
e Turning a linear regression model into a curve - Polynomial regression 
e Modeling nonlinear relationships in the Housing dataset 


Exploring and visualizing the Housing 
dataset 


[back to top] 
波士顿 房价 数据 
Source: https://archive.ics.uci.edu/ml/datasets/Housing 


Attributes: 


CD 
CD 


1. CRIM per capita crime rate by town 每 个 城镇 人 均 犯 罪 率 


2. ZN proportion of residential land zoned for lots over 
25,000 sq.ft. 超过 25000 平 方 尺 用 地 划 为 居住 用 地 的 百分比 

3. INDUS proportion of non-retail business acres per town JE 

零售 商用 地 百分比 

4. CHAS Charles River dummy variable (= 1 if tract bounds 
river; O otherwise) 是 否 被 河道 包围 

5. NOX nitric oxides concentration (parts per 10 million) 

RAIH RA 

6. RM average number of rooms per dwelling 住宅 平均 房间 数目 

7. AGE proportion of owner-occupied units built prior to 1 

940 1940 年 前 建成 自用 单位 比例 

8. DIS weighted distances to five Boston employment centre 

s 5 个 波士顿 就 业 服 务 中 心 的 加 权 距 离 

9. RAD index of accessibility to radial highways 无 障碍 径 向 

高 速 公路 指数 

10. TAX full-value property-tax rate per $10,000 每 万 元 物业 税 

率 

11. PTRATIO pupil-teacher ratio by town 小 学 师 生 比例 

12. B 1000(Bk - 0.63)^2 where Bk is the proportion of bla 

cks 


by town 黑人 比例 指数 
13. LSTAT % lower status of the population 低层 人 口 比例 


14. MEDV Median value of owner-occupied homes in $1000's ib £ 
自 住 房屋 中 值 (要 预测 的 变量 ) 


import pandas as pd 
df = pd.read csv('https://archive.ics.uci.edu/ml/machine-learnin 
g-databases/housing/housing.data', 

header=None, sep='\st' ) 


df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 
df .head( ) 


CRIM ZN INDUS CHAS NOX RM AGE DIS 
0.00632 18.0 2.31 0 0.038 6.575 65.2 4.0900 
0.02731 0.0 7.07 0.469 6421 789 4.9671 
0.02729 0.0 7.07 0.469 7.185 61.1 4.9671 
0.03237 0.0 2.18 0.458 6.998 45.8 6.0622 
0.06905 0.0 2.18 0.458 7.147 542 6.0622 
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0 
数据 分 析 的 第 一 步 是 进行 探索 性 数据 分 析 (Exploratory Data Analysis, EDA) > € 
解 变 量 的 分 布 与 变量 之 间 的 关系 。 

%matplotlib inline 

import matplotlib.pyplot as plt 


import seaborn as sns 
sns.set(style-'whitegrid', context='notebook' ) 


cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV'] 


sns.pairplot(df[cols], size-2.5) 
plt.tight layout() 


回归 模型 和 房价 预测 
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从 图 中 看 出 


LST. 


e RM 和 MEDV 似乎 是 有 线性 关系 的 
e MEDV Z4 normal distribution 
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# correlation map 

import numpy as np 

cm = np.corrcoef(df[cols].values.T) # 计算 相关 系数 
sns.set(font_scale=1.5) 


H BRKRARBENM AA 
hm = sns.heatmap(cm, 
annot=True, 
Square=True, 
fmt-'.2f', 
annot kws=( size': 15}, 
yticklabels=cols, 
xticklabels=cols ) 
plt.tight layout() 
# plt.savefig('./figures/corr mat.png', dpi-300) 


LSTAT 





RM NOX INDUS 


MEDV 





LSTAT INDUS NOX RM MEDV 


e 对 与 MEDV correlation 高 的 变量 感 兴 趣 , LSTAT 3 m (-0.74), 其 次 是 RM (0.7) 
e 但 从 之 前 的 图 看 出 MEDV 与 LSTAT 呈 非 线性 关系 ， 而 与 RM 更 呈 线 性 关系 ， 
所 以 下 面 选 用 RM 来 演示 简单 线性 回归 


sns.reset orig() 
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Implementing a simple regression model - 
Ordinary least squares 


Solving regression parameters with gradient 
descent 


梯度 下 降 法 


梯度 下 降 法 是 一 个 最 优化 算法 ， 通 常 也 称 为 最 速 下 降 法 。 最 速 下 降 法 是 求解 无 约束 
优化 问题 最 简单 和 最 古老 的 方法 之 一 ， 虽 然 现 在 已 经 不 具有 实用 性 ， 但 是 许多 有 效 
算法 都 是 以 它 为 基础 进行 改进 和 修正 而 得 到 的 。 最 速 下 降 法 是 用 负 梯 度 方向 为 搜索 
方向 的 ， 最 速 下 降 法 越 接近 目标 值 ， 步 长 越 小 ， 前 进 越 慢 。 


Wiki EO Mey 4o X RAFI) e ap LR AAA Ram 
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什么 是 梯度 下 降 法 ? 


e 梯度 下 降 法 ， 可 作为 一 种 求解 最 小 二 乘法 的 方式 ， 它 是 阳 优化 中 比较 古老 的 一 
种 方法 

e 梯度 下 降 ， 设 定 起 始点 负 榜 度 方 向 ( 即 数值 减 小 的 方向 ) 为 搜索 方向 ， 寻 找 最 小 
值 。 梯 度 下 降 法 越 接近 目标 值 ， 步 长 越 小 ， 前 进 越 慢 . 
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class LinearRegressionGD(object): 


def — init (self, eta=0.001, n_iter=20): 
self.eta = eta + learning rate 学 习 速 率 
self.n iter = n iter # i44 XX 


def fit(self, X, y): # 训练 函数 
# self.w = np.zeros(1, 1 + X.shape[1]) 
self.coef = np.zeros(shape=(1, X.shape[1])) # 代表 被 训练 
的 系数 ， 初 始 化 为 O 
self.intercept = np.zeros(1) 
self.cost = [] # 用 于 保存 损失 的 空 1ist 


for 1 in range(self.n_iter): 
output = self.net input(X) # 计算 预测 的 Y 
errors = y - output 
self.coef += self.eta * np.dot(errors.T, X) +4 根据 
更 新 规则 更 新 系数 ， 思 考 一 下 为 什么 不 是 减 号 ? 
self .intercept_ += self.eta * errors.sum() # 更 新 bi 
as， 相 当 于 X 取 常数 1 
cost = (errors**2).sum() / 2.0 # 计算 损失 
self.cost .append(cost) # 记录 损失 函数 的 值 
return self 


def net input(self, X): # 给 定 系数 和 X 计 算 预测 的 Y 
return np.dot(X, self.coef_.T) + self.intercept_ 


def predict(self, X): 
return self.net_input(X) 


# RM 作为 explanatory variable 
df[['RM']].values 
df[['MEDV']]. values 
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# standardize 

from sklearn.preprocessing import StandardScaler 
sc x = StandardScaler( ) 

SC y = StandardScaler() 

X std = sc x.fit transform(X) 

y std = sc y.fit transform(y) 


lr - LinearRegressionGD() 
lr.fit(X std, y std); # "RAE HTVA 


# cost function 

plt.plot(range(i, lr.n_iter+1), lr.cost_) 
plt.ylabel( SSE') 

plt.xlabel('Epoch') 

plt.tight layout() 
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发 现在 epoch 5 之 后 cost 基本 就 不 能 再 减 小 了 


# 定义 一 个 绘图 函数 用 于 展示 

def lin_regplot(X, y, model): 
plt.scatter(X, y, c='lightblue') 
plt.plot(X, model.predict(X), color-'red', linewidth=2) 
return None 


# 画 出 预测 

lin regplot(X std, y std, lr) 

plt.xlabel('Average number of rooms [RM] (standardized) ') 
plt.ylabel('Price in $1000\'s [MEDV] (standardized)') 
plt.tight layout() 

plt.show() 
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Average number of rooms [RM] (standardized) 


print('Slope: %.3f' % lr.coef [0]) 
print('Intercept: %.3f' 96 lr.intercept ) 
# 直线 的 斜率 及 截 距 


Slope: 0.695 
Intercept: -0.000 
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num_rooms_std = sc_x.transform([[5.0]]) 
price_std = lr.predict(num_rooms_std) 
print( "Price in $1000's: %.3f" 96 sc y.inverse transform(price st 


d)) 


Price in $1000's: 10.840 


Estimating coefficient of a regression model 
via scikit-learn 


[back to top] 


from sklearn.linear model import LinearRegression 


slr - LinearRegression() 

slr.fit(X std, y std) 

print('Slope: %.3f' 96 slr.coef [0]) 
print('Intercept: %.3f' 96 slr.intercept ) 


Slope: 0.695 
Intercept: -0.000 


lin regplot(X std, y std, slr) 

plt.xlabel('Average number of rooms [RM] (standardized)') 
plt.ylabel('Price in $1000\'s [MEDV] (standardized)') 
plt.tight layout() 
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Average number of rooms [RM] (standardized) 


# 如 果 不 标准 化 ， 直 接 用 原始 数据 进行 回归 
slr.fit(X, y) 

lin_regplot(X, y, slr) 

plt.xlabel('Average number of rooms [RM]') 
plt.ylabel('Price in $1000\'s [MEDV]' ) 
plt.tight layout() 
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Average number of rooms [RM] 


结果 与 使 用 gradient descent 的 结果 接近 ， 思 考 一 下 什么 时 候 需 要 使 用 标准 化 ? 


Fitting a robust regression model using 
RANSAC 


[back to top] 


线性 回归 对 outlier 比较 敏感 , MIRA outlier 是 需要 自己 进行 判断 的 . 另 一 种 
2 ik3Lx RANdom SAmple Consensus (RANSAC) 


大 致 算法 如 下 : 


1. Select a random number of samples to be inliers and fit the model. 

2. Test all other data points against the fitted model and add those points that 
fall within a user-given tolerance to the inliers. 

3. Refit the model using all inliers. 

4. Estimate the error of the fitted model versus the inliers. 

5. Terminate the algorithm if the performance meets a certain user-defined 
threshold or if a fixed number of iterations has been reached; go back to step 
1 otherwise. 


H 使 用 sklearn PLA BR 
from sklearn.linear_model import RANSACRegressor 
ransac = RANSACRegressor(LinearRegression(), 
max_trials=100, # max iteration 
min_samples=50, # min number of randoml 
y chosen samples 
residual_metric=lambda dy: np.sum(np.ab 
s(dy), axis-1), + absolute vertical distances to measure 
residual threshold-5.0, # allow sample 
as inlier within 5 distance units 
random state=0) 
ransac.fit(X, y) 


# 分 出 inlier 和 outlier 
inlier mask = ransac.inlier mask 
outlier mask - np.logical not(inlier mask) 


line X - np.arange(3, 10, 1) 

line y ransac - ransac.predict(line X[:, np.newaxis]) 
plt.scatter(X[inlier mask], y[inlier mask], c='blue', marker='o' 
, label='Inliers') 

plt.scatter(X[outlier mask], y[outlier mask], c='lightgreen', ma 
rker='s', label-'Outliers') 

plt.plot(line X, line y ransac, color-'red') 

plt.xlabel('Average number of rooms [RM]') 

plt.ylabel('Price in $1000\'s [MEDV]') 

plt.legend(loc-'upper left') 


plt.tight layout() 
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print('Slope: %.3f' 96 ransac.estimator .coef [0]) 
print('Intercept: %.3f' 96 ransac.estimator .intercept ) 


Slope: 9.621 
Intercept: -37.137 


RANSAC 减少 了 outlier 的 影响 , 但 对 于 未 知 数据 的 预测 能 力 是 否 有 影响 未 知 ， 


对 比 RANSAC 回归 和 OLS Y ys 


from sklearn import datasets 


n_samples = 1000 
n_outliers = 50 


X, y, coef = datasets.make_regression(n_samples=n_samples, n fea 
tures=1, 
n_informative=1, noise=10, 
coef=True, random_state=0) 


# Add outlier data 
np.random.seed(0) 


X[:n_outliers] 


) 


3 + 0.5 * np.random.normal(size=(n_outliers, 1) 


y[:n_outliers] = -3 + 10 * np.random.normal(size=n_outliers) 
+ Fit line using all data 


model = LinearRegression( ) 
model.fit(X, y) 


# Robustly fit linear model with RANSAC algorithm 
model ransac = RANSACRegressor(LinearRegression( )) 
model ransac.fit(X, y) 
inlier mask = model ransac.inlier mask . 
outlier mask - np.logical not(inlier mask) 
Predict data of estimated models 
line X - np.arange(-5, 5) 
line y - model.predict(line X[:, np.newaxis]) 


line y ransac - model ransac.predict(line X[:, np.newaxis]) 
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print("Estimated coefficients (true, normal, RANSAC):") 
print(coef, model.coef , model ransac.estimator .coef ) 


plt.plot(X[inlier mask], y[inlier mask], '.g', label='Inliers') 
plt.plot(X[outlier mask], y[outlier mask], '.r', label='Outliers' 
) 

plt.plot(line X, line y, '-k', label='Linear regressor') 
plt.plot(line X, line y ransac, '-b', label='RANSAC regressor') 


plt.legend(loc='lower right'); 
E l= 


Estimated coefficients (true, normal, RANSAC): 
(array(82.1903908407869), array([ 54.17236387]), array([ 82.0853 
3159] ) ) 
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Evaluating the performance of linear 
regression models 


It is crucial to test the model on data that it hasn't seen during training to obtain an 
unbiased estimate of its performance. 


[back to top] 


from sklearn.cross_validation import train_test_split 


df.iloc[:, :-1].values 
df['MEDV']|.values 
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X_train, X_test, y_train, y_test = train_test_split( 
X, y, test size-0.3, random state=0) 
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slr = LinearRegression() 


slr.fit(X train, y train) 
y train pred - slr.predict(X train) 
y test pred = slr.predict(X test) 


4 residual plot， 经 常 被 用 来 检查 回归 模型 

plt.scatter(y train pred, y train pred - y train, c='blue', mar 
ker='o', label-'Training data') 

plt.scatter(y test pred, y test pred - y test, c='lightgreen', 
marker='s', label='Test data') 

plt.xlabel('Predicted values' ) 

plt.ylabel( 'Residuals' ) 

plt.legend(loc='upper left') 

plt.hlines(y=0, xmin=-10, xmax-50, lw-2, color='red' ) 
plt.xlim([-10, 50]) 

plt.tight layout() 

# plt.savefig('./figures/slr residuals.png', dpi-300) 
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如 果 预 测 都 是 正确 的 ,那么 residual 就 是 0. 这 是 理想 情况 , 实际 中 , 我 们 布 望 error 
是 随机 分 布 的 . 


从 上 图 看 , 有 部 分 error 是 离 红 色 线 较 远 的 , 可 能 是 outlier 引起 较 大 的 偏差 


from sklearn.metrics import r2_score 

from sklearn.metrics import mean_squared_error 

print('MSE train: %.3f, test: %.3f' % ( 
mean_squared_error(y_train, y_train_pred), 
mean_squared_error(y_test, y_test_pred))) 

print('R^2 train: %.3f, test: %.3f' % ( 
r2_score(y_train, y_train_pred), 
r2_score(y_test, y_test_pred))) 


MSE train: 19.958, test: 27.196 
RA2 train: 0.765, test: 0.673 


SSE MSE 


MSE = - y | (yt = Y fi) ) R* 一 ] — SST -一 m Var(y) 





Turning a linear regression model into a 
curve - Polynomial regression 
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import numpy as np 
from 


X = np.array([258.0, 270.0, 294.0, 
320.0 342 0, 368270. 
396.0, 446.0, 480.0, 586.0])[:, np.newaxis] 


np.array([236.4, 234.4, 252.8, 
2908.6, 314.2, 342.2, 
30026, 368.0, 391:2; 
390.8]) 
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# 添加 二 次 项 和 截 距 项 


from sklearn.preprocessing import PolynomialFeatures 


lr 


pr 
quadratic = PolynomialFeatures(degree=2) 


LinearRegression( ) 


LinearRegression( ) 


X_quad = quadratic.fit_transform(X) 


print(X.shape) 
print(X_quad.shape) 


(10, 1) 
(10, 3) 


X_quad 


array([[ 1.00000000e+00,  2.58000000e+02,  6.65640000e+04], 
| 1.00000000e-00,  2.70000000e+02,  7.29000000e+04], 
| 1.00000000e-00,  2.94000000e+02,  8.64360000e+04], 
| 1.00000000e+00,  3.20000000e+02,  1.02400000e+05], 
| 1.00000000e+00,  3.42000000e+02,  1.16964000e+05], 
[ 1.00000000e+00,  3.68000000e+02,  1.35424000e+05], 
[ 1.00000000e+00,  3.96000000e+02,  1.56816000e+05], 
| 1.00000000e+00,  4.46000000e+02,  1.98916000e+05], 
| 1.00000000e-00,  4.80000000e+02,  2.30400000e+05], 
[ 1.00000000e-00,  5.86000000e+02,  3.43396000e+05]]) 


lr.fit(X, y) 
X fit = np.arange(250,600,10)[:, np.newaxis] 
y lin fit = lr.predict(X fit) 


pr.fit(X quad, y) 
y quad fit - pr.predict(quadratic.fit transform(X fit)) 


plt.scatter(X, y, label-'training points') 

plt.plot(X fit, y lin fit, label-'linear fit', linestyle-'--') 
plt.plot(X fit, y quad fit, label-' quadratic fit ') 
plt.legend(loc-'upper left') 


plt.tight layout() 
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图 上 可 以 发 现 quadratic fit 比 linear 拟 合 效果 更 好 


y lin pred = lr.predict(X) 
y quad pred - pr.predict(X quad) 


print('Training MSE linear: %.3f, quadratic: %.3f' % ( 
mean squared error(y, y lin pred), 
mean squared error(y, y quad pred))) 
print('Training R^2 linear: %.3f, quadratic: %.3f' 96 ( 
r2 score(y, y lin pred), 
r2 score(y, y quad pred))) 


Training MSE linear: 569.780, quadratic: 61.330 
Training R^2 linear: 0.832, quadratic: 0.982 


MSE 下 降 到 61, R^2 EF 819896, 说 明 在 这 个 数据 和 集 上 quadratic fit 效果 更 好 


Modeling nonlinear relationships in the 
Housing dataset 


我 们 会 将 house prices 5 LSTAT 的 quadratic 及 cubic polynomials fit, 并 与 linear 
fit 对 比 
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df[['LSTAT']].values 
df['MEDV']|.values 
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regr = LinearRegression() 


quadratic = PolynomialFeatures(degree=2) 
cubic = PolynomialFeatures(degree=3) 
X_quad = quadratic.fit_transform(X) 
X_cubic = cubic.fit_transform(X) 


X fit = np.arange(X.min(), X.max(), 1)[:, np.newaxis] 


regr = regr.fit(X, y) 
y lin fit - regr.predict(X fit) 
linear r2 - r2 score(y, regr.predict(X)) 


regr = regr.fit(X quad, y) 
y quad fit - regr.predict(quadratic.fit transform(X fit)) 
quadratic r2 - r2 score(y, regr.predict(X quad)) 


regr = regr.fit(X cubic, y) 
y cubic fit - regr.predict(cubic.fit transform(X fit)) 
cubic r2 - r2 score(y, regr.predict(X cubic)) 


plt.scatter(X, y, label='training points', color-'lightgray') 


plt.plot(X fit, y lin fit, 
label='linear (d=1), $R^2-9..2f$' % linear r2, 
color='blue', 
lw=2, 
linestyle=':') 


plt.plot(X_fit, y_quad_fit, 
label-'quadratic (d=2), $R/12=%.2f$' % quadratic r2, 
color='red', 
lw=2, 
linestyle='-') 


plt.plot(X_fit, y_cubic fit, 
label='cubic (d=3), $R^2-9..2f$' 96 cubic r2, 
color-'green', 
lw-2, 
linestyle-'--') 


plt.xlabel('% lower status of the population [LSTAT]') 
plt.ylabel('Price in $1000\'s [MEDV]') 
plt.legend(loc-'upper right') 


plt.tight layout() 
# plt.savefig('./figures/polyhouse example.png', dpi-300) 
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% lower status of the population [LSTAT] 


Transforming the dataset by log: 为 什么 要 这 样 做 ? 是 因为 有 画图 探索 的 启示 ? 


df[['LSTAT']].values 
df[ ' MEDV'].values 
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# transform features 
X_log = np.log(X) 
y sqrt = np.sqrt(y) 


# fit features 
X fit = np.arange(X log.min()-1, X_log.max()+1, 1)[:, np.newaxis 
] 


regr = regr.fit(X_log, y_sqrt) 
y_lin_fit = regr.predict(X_fit) 
linear r2 = r2_score(y_sqrt, regr.predict(X log)) 


# plot results 
plt.scatter(X log, y sqrt, label-'training points', color='light 
gray') 


plt.plot(X fit, y lin fit, 
label='linear (d-1), $RA2=%.2f$' % linear r2, 
color='blue', 
lw=2 ) 


plt.xlabel( 'log(% lower status of the population [LSTAT])') 
plt.ylabel('$\sqrt{Price \; in \; \$1000\'s [MEDV]}$' ) 
plt.legend(loc-'lower left' ) 


plt.tight layout() 
# plt.savefig('./figures/transform example.png', dpi-300) 
plt.show() 
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经 过 log 变换 后 ， 线 性 拟 合 效果 已 经 不 错 , 比 单 纯 polynomial fit 更 好 


练习 : 用 房价 数据 的 其 它 自 变 量 一 起 做 一 个 多 元 
型 看 看 R2 有 没有 改善 
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Sections 


e Implementing a perceptron learning algorithm in Python 
o [raining a perceptron model on the Iris dataset 
e Adaptive linear neurons and the convergence of learning 
o Implementing an adaptive linear neuron in Python 
e Implementing logistic regression in Python 
e Classification with scikit-learn 
o Loading and preprocessing the data 
o Other Available Data 
o [raining a perceptron via scikit-learn 
o Modeling class probabilities via logistic regression 
o Maximum margin classification with support vector machines 
o Solving non-linear problems using a kernel SVM 
o K-nearest neighbors - a lazy learning algorithm 
e Scoring metrics for classification 
o Classification metrics in Scikit-learn 
o Reading a confusion matrix 
o Precision, recall and F-measures 
o ROC and AUC 
o Hinge loss 
o Logloss 


什么 是 感知 机 分 类 


最 简单 形式 的 前 馈 神 经 网 络 ， 是 一 种 二 元 线性 分 类 器 , 把 矩阵 上 的 输入 证 ( 
向 量 ) 映射 到 输出 值 IU) 上 (一 个 二 元 的 值 ) 。 


ifw:rz-b0 
fiz) = qe -j else 
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我 们 首先 定义 一 些 变量 : 

e UU) 表示 n 维 输入 向 量 中 的 第 j 项 

o UND 表示 权重 向 量 的 第 j 项 

o JU) 表示 神经 元 接受 输入 了 7 产生 的 输出 

e poii died RAO a= 1 ($3) 

o 更 进一步 ， 为 了 简便 我 们 假定 偏 置 量 局 等 于 0。 因 为 一 个 额外 的 维度 由 十 】 
ee roy etn +1) l 的 形式 加 到 输入 向 量 ， 这 样 我 们 就 可 以 用 
win +1) 代替 偏 置 量 。 

感知 器 的 学 习 通 过 对 所 有 训练 实例 进行 多 次 的 迭代 进行 更 新 的 方式 来 建 模 。 
A Din = 1(%1,Y41),---, (Em, Ym) } 表示 一 个 有 m 个 训练 实例 的 训练 集 。 


每 次 迭代 权重 向 量 以 如 下 方式 更 新 : 对 于 每 个 Dm = UA Yih [Emo Y) 
中 的 每 个 (2,4) 자 w(J):— wl) + aly Feli) (1=1,...,n) 


注意 这 意味 着 ， 仅 当 针 对 给 定 训练 实例 UU) 产生 的 输出 值 IU 与 预期 的 输出 值 
Y 不 同时 ， 权 重 向 量 才 会 发 生 改 变 。 


如 果 存 在 一 个 正 的 常数 上 和 权重 向 量 WC: HAM UC (4 w, Ti +b) > y 
,训练 集 Dos 就 被 叫做 线性 分 隔 。 然而 ， 如 果 训练 集 不 是 线性 分 隔 的 ， 那 么 
算法 则 不 能 确保 会 收 化 。 


Implementing a perceptron learning 
algorithm in Python 
[back to top] 


import numpy as np 


class Perceptron(object): 
"""Perceptron classifier. 


Parameters 


感知 机 和 人 逻辑 回归 


eta : float 

Learning rate (between 0.0 and 1.0) 
n_iter : int 

Passes over the training dataset. 


Attributes 


Ww : 1d-array 
Weights after fitting. 
errors_ : list 
Number of misclassifications in every epoch. 


def _ init (self, eta-0.01, n_iter=10): 
self.eta = eta 
self.n_iter = n_iter # the number of epochs 


def fit(self, X, y): 
"""Fit training data. 


Parameters 


X : {array-like}, shape = [n samples, n features] 
Training vectors, where n samples is the number of s 
amples and 
n features is the number of features. 
y : array-like, shape - [n samples] 
Target values. 


Returns 


self : object 


self.w = np.zeros(1 + X.shape[1]) + weights, 435180 
self.errors = [] 


# AH sample 循环 更 新 
for | in range(self.n iter): 
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errors = O 
for xi, target in zip(X, y): 
update - self.eta * (target - self.predict(xi)) 
Hlearning rate*error 
self.w_[1:] += update * xi 
self.w_[0] += update 
errors += int(update != 0.0) 
self.errors .append(errors) # 错误 的 分 类 结果 
return self 





def net_input(self, X): 
"""Calculate net input w*x""" 
return np.dot(X, self.w_[1:]) + self.w [0] 








def predict(self, X): 
"""Return class label after unit step""" 
return np.where(self.net_input(X) >= 0.0, 1, -1) 
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Training a perceptron model on the Iris dataset 


这 里 只 考虑 两 种 花 Setosa fe Versicolor , 以 及 两 种 特征 sepal length 和 petal 
length. 


但 是 Perceptron Model 可 以 解决 多 类 别 分 类 问题 , 参考 one-vs-all 


[back to top] 


Reading-in the Iris data 


import pandas as pd 


df = pd.read_csv('data/iris.csv', header=None) 
df.tail() 
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Plotting the Iris data 


6.7 
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6.5 
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3.4 
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virginica 
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virginica 


# 将 两 个 分 类 先 可 视 化 

%matplotlib inline 

import matplotlib.pyplot as plt 
import numpy as np 


select setosa and versicolor 

两 种 各 选择 50 个 ， 把 类 别 改 为 -1 和 1, RAE 
df.iloc[0:100, 4].values 

= np.where(y == 'Iris-setosa', -1, 1) 
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# extract sepal length and petal length 
X = df.iloc[0:100, [0, 2]].values 


# plot data 
plt.scatter(X[:50, 0], X[:50, 1], 

color-'red', marker='o', label-'setosa') 
plt.scatter(X[50:100, 0], X[50:100, 1], 

color-'blue', marker='x', label='versicolor') 


plt.xlabel('petal length [cm]') 
plt.ylabel('sepal length [cm]') 
plt.legend(loc-'upper left') 


plt.tight layout() 
# plt.savefig('./iris 1.png', dpi-300) 
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Training the perceptron model 


ppn = Perceptron(eta=0.1, n_iter=10) 


ppn.fit(X, y) 
ppn.errors_ 


[2, 2, 3, 2, 1, 0, 0, 0, 0, 0] 
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plt.plot(range(i, len(ppn.errors ) + 1), ppn.errors , marker='o' 


) 
plt.xlabel('Epochs') 


plt.ylabel('Number of misclassifications') 


plt.tight layout() 
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f misclassifications 


pi 
Pa 


Number 





Epochs 


结果 error 946 A 0, 证 明 是 convergent 的 , 且 分 类 效果 应 该 说 是 非常 准确 了 


A tunction for plotting decision regions 


这 个 函 


A BR I HR 


from matplotlib.colors import ListedColormap 


# Colormap object generated from a list of colors. 


def plot_decision_regions(X, y, classifier, resolution=0.02): 


T) 


# setup marker generator and color map 

markers = ('s', 'x', 'o', '^', 'v') 

colors - ('red', 'blue', 'lightgreen', 'gray', 'cyan') 
cmap = ListedColormap(colors[:len(np.unique(y))]) 


# plot the decision surface 43 tA hib R 
xi min, xi max = X[:, 0].min() - 1, X[:, 0] .max() + 1 # x 
最 大 +1 

x2 min, x2 max = X[:, 1].min() - 1, X[:, 1].max() + 1 


# create a pair of grid arrays 
# flatten the grid arrays then predict 


XX1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution) 


np.arange(x2_min, x2_max, resolution)) 


N 
| 


= classifier.predict(np.array([xx1.ravel(), xx2.ravel()]). 


N 
| 


= Z.reshape(xx1.shape) 


# maps the different decision regions to different colors 
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) 
plt.xlim(xxi.min(), xx1.max()) 

plt.ylim(xx2.min(), xx2.max()) 


# plot class samples 
for 1dx, cl in enumerate(np.unique(y)): 
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], 
alpha=0.8, c=cmap(idx), 
marker=markers[idx], label-cl) 
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plot decision regions(X, y, classifier-ppn) 
plt.xlabel('sepal length [cm]') 
plt.ylabel('petal length [cm]') 
plt.legend(loc-'upper left') 


plt.tight layout() 
# plt.savefig('./perceptron 2.png', dpi-300) 
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虽然 Perceptron Model £ LB Iris 例子 里 表现 得 很 好 ， 但 在 其 他 问题 上 却 不 一 定 表 
现 得 好 。 Frank Rosenblatt 从 数学 上 证 明了 ， 在 线性 可 分 的 数据 里 ，Perceptron 的 
学 习 规 则 会 converge， 但 在 线性 不 可 分 的 情况 下 ， 却 无 法 converge 


Adaptive linear neurons and the 
convergence of learning (Adaline) 


[back to top] 


Implementing an adaptive linear neuron in 
Python 


[back to top] 
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ADAptive Linear NEuron classifier 也 是 一 个 单 层 神经 网 络 . 它 的 重点 就 是 定义 及 最 
优化 cost function, 对 于 理解 更 高 层次 更 难 的 机 器 学 习 分 类 模型 是 非常 好 的 入 门 . 


© 5 Perceptron 不 同 的 地 方 在 于 更 新 weights 时 是 用 的 linear activation function, 
而 不 是 unit step function. 


ans T ne 
Adaline 中 这 个 linear activation function 输出 等 于 输入 ， plu’ x) T 


然后 activation 后 会 有 一 个 quantizer 用 来 学 习 更 新 weights 


定义 cost function 为 SSE: Sum of Squared Errors 


Emo tn? 
J(w) = E S (y | — plz a 
这 个 function ET $a), # Ext convex 的 , 可 以 进行 也 优化 , 使 用 gradient descent 
E. 


가 (xw 


class AdalineGD(object): 
"""ADAptive LInear NEuron classifier. 


Parameters 
eta : float 

Learning rate (between 0.0 and 1.0) 
n iter : int 

Passes over the training dataset. 


Attributes 


Ww : 1d-array 
Weights after fitting. 
errors_ : list 
Number of misclassifications in every epoch. 


def _ init (self, eta=0.01, n_iter=50): 
self.eta = eta 
self.n_iter = n_iter 


def fit(self, X, y): 


感知 机 和 人 逻辑 回归 


""" Fit training data. 


Parameters 


X : {array-like}, shape = [n samples, n features] 


Training vectors, where n samples is the number of s 


amples and 
n features is the number of features. 
y : array-like, shape - [n samples] 
Target values. 


Returns 


self : object 


self.w = np.zeros(i + X.shape[1]) 
self.cost_ = [] 


# gradient descent 

for 1 in range(self.n_iter): 
output = self.net_input(X) 
errors = (y - output) 
self.w_[1:] += self.eta * X.T.dot(errors) 
self.w_[0] += self.eta * errors.sum() 
cost = (errors**2).sum() / 2.0 


self.cost_.append(cost) + cost list, to check algori 
thm convergence 


return self 


def net_input(self, X): 
"""Calculate net input""" 
return np.dot(X, self.w_[1:]) + self.w [0] 


def activation(self, X): 
"""Compute linear activation""" 
return self.net input(X) 


def predict(self, X): 
"""Return class label after unit step""" 
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return np.where(self.activation(X) >= 0.0, 1, -1) 


H 测试 两 种 learning rate, 0.01 和 0.0001 
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4)) 


adal = AdalineGD(n_iter=10, eta=0.01).f1t(X, y) 
ax[0].plot(range(1, len(ada1.cost_) + 1), np.log10(adal.cost_), 
marker= 0 ) 

ax[0].set_xlabel('Epochs') 
ax[0].set_ylabel('log(Sum-squared-error)') 

ax[0].set title('Adaline - Learning rate 0.01') 


ada2 = AdalineGD(n_iter=10, eta=0.0001).f1t(X, y) 
ax[1].plot(range(1, len(ada2.cost_) + 1), ada2.cost , marker='o' 


) 

ax[1].set xlabel('Epochs') 

ax[1].set_ylabel( 'Sum-squared-error' ) 

ax[1].set title('Adaline - Learning rate 0.0001') 


plt.tight layout() 
# plt.savefig('./adaline 1.png', dpi-300) 


on Adaline - Learning rate 0.01 _ Adaline - Learning rate 0.0001 
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左 图 显示 learning rate AX, error RAE), km EXT. 


右 图 显示 learning rate A), error 变化 速度 太 小 


Standardizing features and re-training adaline 





# standardize features 

X_std = np.copy(X) 

X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() 
X_std[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std() 


ada = AdalineGD(n_iter=15, eta=0.01) 
ada.fit(X_std, y) 


plot decision regions(X std, y, classifier-ada) 
plt.title('Adaline - Gradient Descent') 
plt.xlabel('sepal length [standardized] | ) 
plt.ylabel('petal length [standardized] ') 
plt.legend(loc-'upper left') 

plt.tight layout() 

# plt.savefig('./adaline 2.png', dpi-300) 
plt.show() 


plt.plot(range(i, len(ada.cost ) + 1), ada.cost , marker='0') 
plt.xlabel('Epochs') 
plt.ylabel('Sum-squared-error') 


plt.tight layout() 
# plt.savefig('./adaline 3.png', dpi-300) 
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Adaline - Gradient Descent 


petal length [standardized] 
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分 类 效果 不 错 , error 最 终 接 近 于 0 值得 注意 的 是 ， 虽 然 我 们 的 分 类 全 部 正确 ， 但 
error 也 不 等 于 0 


ada.w_ Zweights 


array([ 1.36557432e-16, -1.26256159e-01,  1.10479201e+00]) 


Large scale machine learning and stochastic 
gradient descent 
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Stochastic gradient descent 随机 梯度 下 降 比 一 般 的 梯度 下 降 更 有 优势 , 因为 每 一 步 


计算 的 cost 更 小 , 每 一 步 更 新 都 是 随机 取 其 中 一 小 步 更 新 就 可 . 
batch gradient descent 一 次 更 新 需要 计算 一 遍 整 个 数据 集 
stochastic gradient descent 一 次 更 新 只 需 计 算 一 个 数据 点 
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class AdalineSGD(object): 
"""ADAptive LInear NEuron classifier. 


Parameters 
eta : float 

Learning rate (between 0.0 and 1.0) 
n iter : int 

Passes over the training dataset. 


Attributes 
Ww : 1d-array 

Weights after fitting. 
errors : list 


Number of misclassifications in every epoch. 


shuffle : bool (default: True) 


Shuffles training data every epoch if True to prevent cy 


cles. 
random_state : int (default: None) 


Set random state for shuffling and initializing the weig 


hts. 


def _ init (self, eta=0.01, n iter-10, shuffle=True, 


_state=None): 
self.eta = eta 
self.n_iter = n_iter 
self.w_initialized = False 
self.shuffle = shuffle 


random 


if random state: # allow the specication of a random se 
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ed for consistency 
np.random.seed(random_state) 


def fit(self, X, y): 
""" Fit training data. 


Parameters 
X : {array-like}, shape = [n samples, n features] 
Training vectors, where n samples is the number of s 
amples and 
n features is the number of features. 
y : array-like, shape - [n samples] 
Target values. 


Returns 


self : object 


self. initialize weights(X.shape[1]) 
self.cost - [] 
for 1 in range(self.n iter): 
if self.shuffle: 
X, y = self. shuffle(X, y) 
cost = [] 
for xi, target in zip(X, y): 
cost.append(self. update weights(xi, target)) 
avg cost - sum(cost) / len(y) 
self.cost .append(avg cost) 
return self 


# 在 每 个 epoch 前 是 否 shuffle data 

def _shuffle(self, X, y): 
"""Shuffle training data > 
r = np.random.permutation(len(y)) 
return X[r], y[r] 


def _initialize weights(self, m): 


"""Initialize weights to zeros""" 


x 2 lo Ll. Y Z 一 1] 
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self.w_ = np.zeros(1 + m) 
self.w initialized = True 


#stochastic gradient descent 
def update weights(self, xi, target): 
"""Apply Adaline learning rule to update the weights""" 
output = self.net input(xi) 
error = (target - output) 
self.w [1:] += self.eta ^ xi.dot(error) + 仅 一 个 error 485 


self.w [0] += self.eta * error # 仅仅 是 一 个 error, 36 sum 


cost = 0.5 * error**2 
return cost 


def net input(self, X): 
"""Calculate net input""" 
return np.dot(X, self.w [1:]) + self.w [0] 


def activation(self, X): 
"""Compute linear activation""" 
return self.net input(X) 


def predict(self, X): 
"""Return class label after unit step""" 
return np.where(self.activation(X) >= 0.0, 1, -1) 


jo RN 
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# plot result 
ada = AdalinesGD(n_iter=15, eta=0.01, random_state=1) 
ada.fit(X_std, y) 


plot decision regions(X std, y, classifier-ada) 
plt.title('Adaline - Stochastic Gradient Descent') 
plt.xlabel('sepal length [standardized]') 
plt.ylabel('petal length [standardized]') 
plt.legend(loc-'upper left') 


plt.tight layout() 
4plt.savefig('./adaline 4.png', dpi=300) 
plt.show() 


plt.plot(range(i, len(ada.cost ) + 1), ada.cost , marker='0') 
plt.xlabel('Epochs') 


plt.ylabel('Average Cost') 


plt.tight layout() 
# plt.savefig('./adaline 5.png', dpi=300) 
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Average Cost 





Epochs 


xt HTM AL » Stochastic gradient descent #9 주 3 2 44 +4 Batch gradient 
descnent 的 高 很 多 


Implementing logistic regression in 
Python 


[back to top] 
logistic regression: powerful algorithm for linear and binary classication problems 
p 


odds ratio: the odds in favor of a particular event. (1 — p) , If p stands for the 





logit(p) = log 
probability of the positive event, we want to predict. 





l—p 


and P(Y = 112) is conditional probability that a particular sample belongs to class 


1 given its features x. Therefore, the logit is 


logit(p(y = 1x)) = 10020 + wWz1+...+W&m%m = >. ud. 


we want to know the probability, which is inverse of logit function, we call it logistic 
1 


Plz) = ——— 
o 1+ ㅜ 6 


funcition, or sigmoid function. | 


output of sigmoid function is as the probability of particular sample belonging to 
class 1 


Plot sigmoid function: 


%matplotlib inline 
import matplotlib.pyplot as plt 
import numpy as np 


def sigmoid(z): 
return 1.0 / (1.0 + np.exp(-z)) 


z = np.arange(-7, 7, 0.1) 
phi z = sigmoid(z) 


plt.plot(z, phi z) 
plt.axvline(0.0, color='k') 
plt.ylim(-0.1, 1.1) 
plt.xlabel('z') 
plt.ylabel('$*phi (z)$') 


plt.yticks([0.0, 0.5, 1.0]) 
ax - plt.gca() 


ax.yaxis.grid(True) 


plt.tight layout() 





when (z) approached 1 if zo?€, goes to 1 ifz— ~ 


Plot cost function: 


use log-likelihood function to redefine cost function 


-log(ó(z)) if y — 1 


J(o(z), y; 10) | —log(1— d(z)) ify=0 


then 


def cost 1(z): 
return - np.log(sigmoid(z)) 


def cost O(z): 
return - np.log(1 - sigmoid(z)) 


z - np.arange(-10, 10, 0.1) 
phi z = sigmoid(z) 


ci = [cost 1(x) for x in z] 
plt.plot(phi z, c1, label-'J(w) if y=1') 


cO = [cost O(x) for x in z] 
plt.plot(phi z, cO, linestyle-'--', label='J(w) if y=0') 


plt.ylim(0.0, 5.1) 

plt.xlim([0, 1]) 

plt.xlabel('$*Nphi$(z)!') 

plt.ylabel('J(w)') 

plt.legend(loc='best') 

plt.tight layout() 

# plt.savefig('./figures/log cost.png', dpi-300) 


# this illustrates the cost for the classification of a single-s 
ample instance for diff values of phi(z) 
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cost it TO 如 果 正 确 预 测 class. 


Implement in Python 


The following implementation is similar to the Adaline implementation except that 


we replace the sum of squared errors cost function with the logistic cost function 


J(w) — I» — tt) log (06) = (1 o vlog (1 o 0620) ). 


class LogisticRegression(object): 


"""LogisticRegression classifier. 


Parameters 


eta 


float 


Learning rate (between 0.0 and 1.0) 


Nn iter : int 


Passes over the training dataset. 


Attributes 


1d-array 


Weights after fitting. 


cost_ : list 


def _ init (self, eta=0.01, 


def 


Cost in every epoch. 


self.eta = eta 
self.n_iter = n_iter 


fit(self, X, y): 


""" Fit training data. 


Parameters 


X : {array-like}, shape 


Training vectors, where n samples is the number of s 


amples and 


n iter-50): 


- [n samples, 


n features] 


n features is the number of features. 


y : array-like, shape 


[n samples] 


n(x))) 


def 


y_val))) 


def 


def 


def 


def 


Target values. 


Returns 


self : object 


self.w = np.zeros(i + X.shape[1]) 
self.cost_ = [] 
for 1 in range(self.n_iter): 
y_val = self.activation(X) 
errors = (y - y_val) 
neg_grad = X.T.dot(errors) 
self.w_[1:] += self.eta * neg_grad 
self.w_[0] += self.eta * errors.sum() 
self.cost .append(self. logit cost(y, self.activatio 


return self 


 logit cost(self, y, y val): 


logit = -y.dot(np.log(y val)) - ((1 - y).dot(np.log(i - 


return logit 


_sigmoid(self, z): 


return 1.0 / (1.0 + np.exp(-z)) 


net input(self, X): 
"""Calculate net input""" 
return np.dot(X, self.w_[1:]) + self.w [0] 


activation(self, X): 

""" Activate the logistic neuron""" 
z = self.net input(X) 

return self. sigmoid(z) 


predict proba(self, X): 


Predict class probabilities for X. 
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Parameters 


X : {array-like, sparse matrix), shape = [n_samples, n_f 
eatures] 
Training vectors, where n_samples is the number of s 
amples and 
n_features is the number of features. 


Returns 


Class 1 probability : float 


return activation(X) 


def predict(self, X): 


Predict class labels for X. 


Parameters 


X : {array-like, sparse matrix}, shape = [n_samples, n_f 
eatures] 
Training vectors, where n_samples is the number of s 
amples and 
n_features is the number of features. 


Returns 


class : int 
Predicted class label. 


# equivalent to np.where(self.activation(X) >= 0.5, 1, 0) 


return np.where(self.net input(X) >= 0.0, 1, 0) 


站 


83 


array([0, 0, 0,0, 0, ©, 0, ©, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 
0, 0, 0, 0, 

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 


lr = LogisticRegression(n_iter=500, eta=0.02).fit(X_std, y) 
plt.plot(range(i, len(lr.cost ) + 1), np.log10(lr.cost_)) 
plt.xlabel('Epochs') 

plt.ylabel('Cost') 

plt.title('Logistic Regression - Learning rate 0.02') 


plt.tight layout() 


Lagistic Regression - Learning rate 0.02 





Epachs 
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plot decision regions(X std, y, classifier-lr) 
plt.title('Logistic Regression - Gradient Descent') 
plt.xlabel('sepal length [standardized] | ) 
plt.ylabel('petal length [standardized]') 
plt.legend(loc-'upper left') 

plt.tight layout() 


Lagistic Regression - Gradient Descent 


petal length [standardized] 
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sepal length [standardized] 


Classification with scikit-learn 


[back to top] 


Loading and preprocessing the data 


[back to top] 


Loading the Iris dataset from scikit-learn. Here, the third column represents the 
petal length, and the fourth column the petal width of the flower samples. The 
classes are already converted to integer labels where O=Iris-Setosa, 1=Iris- 
Versicolor, 2-lris-Virginica. 
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from sklearn import datasets 
import numpy as np 


iris = datasets.load_iris() 

X = iris.data[:, [2, 3]] 

y = iris.target 

print('Class labels:', np.unique(y) ) 


print(iris.target_names ) 


('Class labels:', array([0, 1, 2])) 
['setosa' ‘versicolor’ ‘virginica’ | 


Splitting data into 70% training and 30% test data: 


from sklearn.cross_validation import train_test_split 


X_train, X_test, y_train, y_test = train_test_split( 
X, y, test size-0.3, random state=0) 


Standardizing the features: 


from sklearn.preprocessing import StandardScaler 
sc = StandardScaler() 
sc.fit(X train) 


X train std = sc.transform(X train) 
X test std = sc.transform(X test) 


Other Available Data 
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Scikit-learn makes available a host of datasets for testing learning algorithms. 
They come in three flavors: 


e Packaged Data: these small datasets are packaged with the scikit-learn 
installation, and can be downloaded using the tools in 

sklearn.datasets.load_* 

e Downloadable Data: these larger datasets are available for download, and 
scikit-learn includes tools which streamline this process. These tools can be 
foundin sklearn.datasets.fetch_* 

e Generated Data: there are several datasets which are generated from 
models based on a random seed. These are available in the 


sklearn.datasets.make ^ 


You can explore the available dataset loaders, fetchers, and generators using 
IPython's tab-completion functionality. After importing the datasets submodule 
from sklearn , type 


datasets.load «TAB» 


Or 


datasets.fetch «TAB» 


Or 


datasets.make «TAB» 


to see a list of available functions. 


The data downloaded using the fetch scripts are stored locally, within a 
subdirectory of your home directory. You can use the following to determine where 
It is: 


from sklearn.datasets import get data home 
get data home() 


Training a perceptron via scikit-learn 
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from sklearn.linear_model import Perceptron 


ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0) 
ppn.fit(X_train_std, y_train) 


Perceptron(alpha=0.0001, class weight=None, eta0=0.1, fit interc 
ept-True, 

n iter-40, n jobs=1, penalty=None, random state-O, shuffle 
-True, 

verbose=0, warm start-False) 


y test.shape 
(45, ) 


y pred - ppn.predict(X test std) + 
print('Misclassified samples: %d' 96 (y test !- y pred).sum( )) 


Misclassified samples: 4 


from sklearn.metrics import accuracy score 


print('Accuracy: %.2f' % accuracy score(y test, y pred)) 


Accuracy: 0.91 


from matplotlib.colors import ListedColormap 
import matplotlib.pyplot as plt 
%matplotlib inline 


# 重新 定义 画 决 东边 界 函 数 ， 使 得 能 区 分 训练 数据 和 测试 数据 
def plot_decision_regions(X, y, classifier, test_idx=None, resol 
ution=0.02): 


# setup marker generator and color map 

markers = ('s', 'x', 'o', '^', 'v') 

colors - ('red', 'blue', 'lightgreen', 'gray', 'cyan') 
cmap = ListedColormap(colors[:len(np.unique(y))]) 


# plot the decision surface 

x1 min, x1_max = X[:, 0].min() - 1, X[:, O].max() + 1 

x2 min, x2 max = X[:, 1].min() - 1, X[:, 1].max() + 1 

XX1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution) 


np.arange(x2_min, x2_max, resolution)) 
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]). 
T) 
Z = Z.reshape(xx1.shape) 
plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) 
plt.xlim(xx1.min(), xx1.max()) 
plt.ylim(xx2.min(), xx2.max()) 


# plot all samples 
for 1dx, cl in enumerate(np.unique(y)): 
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], 
alpha=0.8, c=cmap(idx), 
marker=markers[idx], label-cl) 


# highlight test samples 
if test idx: 
X test, y test = X[test idx, :], y[test_idx] 
plt.scatter(X test[:, 0], X test[:, 1], c='', 
alpha=1.0, linewidth=1, marker='o', 
s=55, label='test set') 


do Aue 3E 라띠 ja 


Training a perceptron model using the standardized training data: 


X combined std - np.vstack((X train std, X test std)) 
y combined - np.hstack((y train, y test)) 


plot decision regions(X=X combined std, y=y combined, 
classifier-ppn, test idx-range(105,150)) 

plt.xlabel('petal length [standardized]') 

plt.ylabel('petal width [standardized] ' ) 

plt.legend(loc-'upper left') 


plt.tight layout() 


# plt.savefig('./figures/iris perceptron scikit.png', dpi=300) 
HoOBUXG37T223Xx— 


petal width [standardized] 
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petal length [standardized] 


Peceptron 模型 对 于 并 不 是 完全 线性 隔离 的 dataset 不 能 converge, 所 以 实际 应 用 
中 并 不 多 用 . 


Modeling class probabilities via logistic 
regression 


[back to top] 
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# use Logistic Regression 

from sklearn.linear_model import LogisticRegression 
# C parameter 4t 4%? 

lr = LogisticRegression(C=1000.0, random state=0) 
lr.fit(X train std, y train) 


plot decision regions(X combined std, y combined, 
classifier-lr, test idx-range(105,150)) 

plt.xlabel('petal length [standardized] | ) 

plt.ylabel('petal width [standardized] ' ) 

plt.legend(loc-'upper left') 

plt.tight layout() 

# plt.savefig('./figures/logistic regression.png', dpi-300) 


petal width [standardized] 
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petal length [standardized] 


lr.predict_proba(X_test_std[0,:].reshape(1,-1)) # predict probab 
ility 


array([[ 2.05743774e-11, 6.31620264e-02, 9.36837974e-01]]) 


Regularization path 


解决 overfitting: 模型 拟 合 的 过 好 , 以 致 于 没有 一 般 性 , 预测 新 的 样本 的 结果 就 会 很 差 


一 般 过 拟 合 的 模型 会 有 high variance 
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"m 


A ^ ; 
slw? =5 > ui 


最 第 用 的 解决 方法 就 叫做 L2 regulatization yal 


其 中 入 就 是 regularization parameter, 可 以 用 来 控制 拟 合 训练 数据 的 好 坏 , 而 


C=> 0.4 
^ 就 是 前 面 提 到 过 的 parameter 


weights, params = |], [|] 

for c in np.arange(-5, 5): 
lr = LogisticRegression(C=10**c, random state=0) 
lr.fit(X_train_std, y_train) 
weights.append(lr.coef [1]) 
params.append(10**c) 


weights = np.array(weights) 

plt.plot(params, weights[:, 0], 
label='petal length') 

plt.plot(params, weights[:, 1], linestyle='--', 
label='petal width') 

plt.ylabel('weight coefficient') 

plt.xlabel('C') 

plt.legend(loc-'upper left') 

plt.xscale('log') 

# plt.savefig('./figures/regression path.png', dpi-300) 


& C 减 小 的 话 ， 就 是 增加 regularization 


weight coefficient 
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Logistic regression with regularization 


class LogitGD(object): 
"""Logistic Regression classifier. 


Parameters 
eta : float 

Learning rate (between 0.0 and 1.0) 
n iter : int 

Passes over the training dataset. 


Attributes 
Ww : 1d-array 
Weights after fitting. 
errors_ : list 
Number of misclassifications in every epoch. 


def — init (self, eta=0.01, lamb = 0.01, n iter-50): 
self.eta = eta 
self.n_iter = n_iter 
self.lamb = lamb 


def fit(self, X, y): 
""" Fit training data. 


Parameters 
X : {array-like}, shape = [n samples, n features] 
Training vectors, where n samples is the number of s 
amples and 
n features is the number of features. 
y : array-like, shape - [n samples] 
Target values. 
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Returns 


self : object 


self.w = np.zeros(i + X.shape[1]) 
self.cost_ = [] 


for 1 in range(self.n_iter): 
output = self.net_input(X) 
errors = (y - output) 
self.w_[1:] += self.eta * X.T.dot(errors) - self.lam 
b* self.w [1:] 
self.w [0] += self.eta * errors.sum() 
cost = (errors**2).sum() / 2.0 + self.lamb* np.sum(s 
elf.w [1:]**2) 
self.cost .append(cost) 
return self 


def net input(self, X): 
"""Calculate net input""" 
return np.dot(X, self.w [1:]) + self.w [0] 


def sigmoid(z): 
return 1.0 / (1.0 + np.exp(-z)) 


def activation(self, X): 
"""Compute linear activation""" 
return sigmoid(self.net input(X)) 


def predict(self, X): 
"""Return class label after unit step""" 
return np.where(self.activation(X) >= 0.5, 1, -1) 


其 它 的 分 类 


ang 


简 介 


Maximum margin classification with 
support vector machines 


目的 是 maximize the margin, margin 是 分 离 决 策 边 界 与 离 之 最 近 的 训练 样本 之 间 
的 距离 . 


Support vectors 


Decision boundary 
w'x=0 


“positive” 
hyperplane 
w'x=1 


“negative” 
hyperplane 
w!x = -1 


X1 


SVM: 
ict | ? 
blica hyperplane: Maximize the margin 





[back to top] 


from sklearn.svm import SVC 


svm = SVC(kernel-'linear', C=1.0, random_state=0) 
svm.fit(X_train_std, y_train) 


plot decision regions(X combined std, y combined, 
classifier=svm, test idx-range(105,150)) 

plt.xlabel('petal length [standardized] ' ) 

plt.ylabel('petal width [standardized]') 

plt.legend(loc-'upper left') 

plt.tight layout() 


感知 机 和 逻辑 回归 


petal width [standardized] 
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petal length [standardized] 


Solving non-linear problems using a kernel 
SVM 


SVM 可 以 解决 非 线性 问题 


[back to top] 
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# create a simple dataset 

np.random.seed(0) 

X_xor = np.random.randn(200, 2) 

y_xor = np.logical_xor(X_xor[:, 0] > 0, X_xor[:, 1] > 0) # 1004 
with label 1, 100 withlabel 0 

y xor = np.where(y xor, 1, -1) 


plt.scatter(X xor[y xor==1, 0], X xor[y xor==1, 1], c='b', marke 
r='x', label='1') 

plt.scatter(X_xor[y_xor==-1, 0], X_xor[y_xor==-1, 1], c='r', mar 
ker='s', label="-1') 


plt.xlim([-3, 3]) 

plt.ylim([-3, 3]) 

plt.legend(loc='best' ) 

plt.tight layout() 

# plt.savefig('./figures/xor.png', dpi-300) 


# 使 用 普通 的 “1Linear logistic Regression 不 能 很 好 将 样本 分 为 HTVe 和 -ve 





rbf 是 指 radial basis function kernel 或 者 Gaussian kernel 


lx (2) — p É 


k(x (2) E: (3) ) == exp( — ~ ~. 


po 


Adi x! | 
simplified to € PC 77 |# "| ) with? — o 


感知 机 和 逻辑 回归 


# 使 用 svm kernel 方法 ， 投 射 到 高 纬度 中 ， 使 之 成 为 线性 可 分 离 的 
svm = SVC(kernel='rbf', random_state=0, gamma=0.10, C=10.0) 
svm.fit(X xor, y xor) 
plot decision regions(X xor, y xor, 
classifier=svm) 


plt.legend(loc-'upper left') 

plt.tight layout() 

# plt.savefig('./figures/support vector machine rbf xor.png', dp 
1-300) 





x E =] Ü 1 2 3 


其 中 1 parameter 可 以 被 理解 为 cut-off parameter for Gaussian sphere 


当 ? 增加 , 也 就 增加 了 训练 样本 的 影响 , 也 就 会 使 决策 边界 变 得 模糊 
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petal width [standardized] 


from sklearn.svm import SVC 

HH gamma 较 小 

svm = SVC(kernel='rbf', random_state=0, gamma=0.2, C=1.0) 
svm.fit(X_train_std, y_train) 


plot decision regions(X combined std, y combined, 
classifier=svm, test idx-range(105,150)) 

plt.xlabel('petal length [standardized]') 

plt.ylabel('petal width [standardized] ' ) 

plt.legend(loc-'upper left') 

plt.tight layout() 


# plt.savefig('./figures/support vector machine rbf iris 1.png', 


dpi-300) 
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# gamma (KA, UR tight 
svm = SVC(kernel='rbf', random_state=0, gamma=100.0, C=1.0) 
svm.fit(X train std, y train) 


plot decision regions(X combined std, y combined, 
classifier=svm, test idx-range(105,150)) 

plt.xlabel('petal length [standardized] | ) 

plt.ylabel('petal width [standardized]') 

plt.legend(loc-'upper left') 

plt.tight layout() 


# plt.savefig('./figures/support vector machine rbf iris 2.png', 


dpi-300) 
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K-nearest neighbors - a lazy learning algorithm 
it doesn't learn a discriminative function from the training data but memorizes the 
training dataset instead. 


1. Choose the number of k and a distance metric. 
2. Find the k nearest neighbors of the sample that we want to classify. 
3. Assign the class label by majority vote. 


这 种 方法 好 处 在 于 新 数据 进来 , 分 类 器 可 以 马上 学 习 并 适应 , 但 是 计算 成 本 也 是 线性 
增长 ,存储 也 是 问题 . 
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from sklearn.neighbors import KNeighborsClassifier 
# DROME 
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski' 


) 
knn.fit(X_train_std, y_train) 


plot decision regions(X combined std, y combined, 
classifier=knn, test idx=range(105,150)) 


plt.xlabel('petal length [standardized]') 

plt.ylabel('petal width [standardized] ' ) 
plt.legend(loc-'upper left') 

plt.tight layout() 

# plt.savefig('./figures/k nearest neighbors.png', dpi=300) 


lo TE 


petal width [standardized] 





petal length [standardized] 


如 何 选 择 kK 是 一 个 重点 , 并 且 需 要 标准 化 数据 . 例子 中 用 到 的 'minkowski' distance 
是 普通 的 Euclidean 和 Manhattan distance 的 扩展 . 
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Scoring metrics for classification 
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Classification metrics in Scikit-learn 
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The sklearn.metrics module implements several loss, score, and utility 
functions to measure classification performance. Some metrics might require 
probability estimates of the positive class, confidence values, or binary decisions 
values. Most implementations allow each sample to provide a weighted 
contribution to the overall score, through the sample weight parameter. 


Some of these are restricted to the binary classification case: 


Compute the Matthews 
correlation coefficient 
(MCC) for binary 
classes 


matthews corrcoef (y true, y pred) 


Compute precision- 
precision recall curve (y true, probas pred) recall pairs for different 
probability thresholds 


Compute Receiver 
roc curve (y true, y score[, pos label, ...]) operating characteristic 
(ROC) 


Others also work in the multiclass case: 


Compute confusion matrix 
confusion matrix (y true, y pred[, labels]) to evaluate the accuracy 
of a classification 


Average hinge loss (non- 


hinge loss (y true, pred decision[, labels, ...]) regularized) 


Some also work in the multilabel case: 


accuracy_score (y true, y_pred[, normalize, ...]) 


classification report (y true, y pred[, ...]) 


fi score (y true, y pred[, labels, ...]) 


fbeta score (y true, y pred, betal, labels, ...]) 


hamming loss (y true, y pred[, classes]) 


jaccard similarity score (y true, y pred[, ...]) 


log loss (y true, y pred[, eps, normalize, ...]) 


precision recall fscore support (y true, y pred) 


precision score (y true, y_predf, labels, ...]) 
recall score (y true, y predf, labels, ...]) 


zero one loss (y true, y pred[, normalize, ...]) 


Accuracy 
classification score. 


Build a text report 
showing the main 
classification 
metrics 


Compute the F1 
score, also known 
as balanced F- 
score or F- 
measure 


Compute the F- 
beta score 


Compute the 
average Hamming 
loss. 


Jaccard similarity 
coefficient score 


Log loss, aka 
logistic loss or 
cross-entropy loss. 


Compute precision, 
recall, F-measure 
and support for 
each class 


Compute the 
precision 


Compute the recall 


Zero-one 
classification loss. 


And some work with binary and multilabel (but not multiclass) problems: 


Compute average 
average precision score (y true, y score[,..]) precision (AP) from 
prediction scores 


Compute Area Under 
roc auc score (y true, y _scorel, average, ...]) the Curve (AUC) from 
prediction scores 


from sklearn import datasets 

from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import StandardScaler 

from sklearn.svm import SVC 


X, y = datasets.make classification(n classes=2, random state=0) 


X train, X test, y train, y test = train test split( 
X, y, test size-0.3, random state=0) 


sc = StandardScaler() 
sc.fit(X train) 
X train std = sc.transform(X train) + standardize by mean 


X test std = sc.transform(X test) 


model = SVC(probability=True, random state=0) 
model.fit(X train std, y train); 


default score for classification in sklearn is accuracy (标签 预测 正确 的 比例 ) 
accuracy(y, Y) = 글 ¿io MY = Yi) where LT) is the indicator function 


model.score(X_test_std, y_test) 
0.83333333333333337 


from sklearn.metrics import accuracy_score 
y_pred = model.predict(X_test_std) 
accuracy_score(y_test, y_pred) 


0.8333333333333333/ 


Reading a confusion matrix 
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Predicted class 
P N 
True False 
P | Positives Negatives 
(TP) (FN) 
Actual 
Class 
False True 
N | Positives Negatives 


(FP) (TN) 














For multi-class problems, it is often interesting to know which of the classes are 
hard to predict, and which are easy, or which classes get confused. One way to 
get more information about misclassifications is the confusion matrix , which 
shows for each true class, how frequent a given predicted outcome is. 


from sklearn.metrics import confusion matrix 
y test pred - model.predict(X test std) 
confmat = confusion matrix(y test, y test pred) 


print(confmat ) 


[[15 3] 
[ 2 10]] 


plt.matshow(confusion_matrix(y_test, y_test_pred), cmap=plt.cm.B 
lues) 

plt.colorbar( ) 

plt.xlabel("Predicted label") 

plt.ylabel("True label"); 
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Precision, recall and F-measures 
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e Precision is how many of the predictions for a class are actually that class. 
e Recall is how many of the true positives were recovered: 
e f1-score is the geometric average of precision and recall: 


With TP, FP, TN, FN, FPR, TPR standing for "true positive", "false positive", "true 
negative" and "false negative", "false positive rate", "true positive rate" repectively: 


\begin{align} &PRE = \frac{TP}{TP+FP} \ &REC = TPR = \frac{TP}{\FN+TP} \ &F1 
= 2 \frac{PRE Mimes REC}PRE+REC} \ &F_\beta = (1+\betaX2)\frac{PRE \times 
REC}{\betaX2 PRE+REC} \ & FPR = \frac{FP}{FP+TN} A & TPR = Mrac(TPHFN-*TP! 
\end{align} 


from sklearn.metrics import precision_score, recall_score, f1_sc 
ore, fbeta_score 


print('Precision: %.3f' % precision_score(y_true=y_test, y_pred= 
y_test_pred)) 

print('Recall: %.3f' % recall score(y_true=y_test, y_pred=y_test 
_pred)) 

print('F1: %.3f' 96 11 score(y_true=y_test, y_pred=y_test_pred)) 
print('F_beta2: %.3f' % fbeta_score(y_true=y_test, y_pred=y_test 
_pred, beta=2)) 


Precision: 0.769 
Recall: 0.833 
F1: 0.800 
F_beta2: 0.820 


Another useful function is the classification_report which provides precision, recall, 
fscore and support for all classes. 


from sklearn.metrics import classification_report 
print(classification_report(y_test, y_test_pred)) 


precision recall f1-score support 

0 0.88 0.83 0.86 18 

1 0.77 0.83 0.80 12 

avg / total 0.84 0.83 0.83 30 


These metrics are helpful in two particular cases that come up often in practice: 


1. Imbalanced classes, that is one class might be much more frequent than the 
other. 

2. Asymmetric costs, that is one kind of error is much more "costly" than the 
other. 


ROC and AUC 
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A receiver operating characteristic curve, or ROC curve, is a graphical plot which 
illustrates the performance of a binary classifier system as its discrimination 
threshold is varied. It is created by plotting the fraction of true positives out of the 
positives (TPR = true positive rate) vs. the fraction of false positives out of the 
negatives (FPR = false positive rate), at various threshold settings. TPR is also 
known as sensitivity, and FPR is one minus the specificity or true negative rate. 


如 果 分 类 器 效果 很 好 , 那么 图 应 该 会 在 左上 角 . 


在 ROC curve 的 基础 上 , 可 以 计算 AUC -- area under the curve. 


Area Under Curve 


The AUC is a common evaluation metric for binary classification problems. 
Consider a plot of the true positive rate vs the false positive rate as the threshold 
value for classifying an item as 0 or is increased from 0 to 1: if the classifier is very 
good, the true positive rate will increase quickly and the area under the curve will 
be close to 1. If the classifier is no better than random guessing, the true positive 
rate will increase linearly with the false positive rate and the area under the curve 
will be around 0.5. 


One characteristic of the AUC is that it is independent of the fraction of the test 
population which is class 0 or class 1: this makes the AUC useful for evaluating 
the performance of classifiers on unbalanced data sets. 


def roc_curve(true_labels, predicted_probs, n_points=100, pos_cl 


thr = np.linspace(0,1,n_points) 
tpr = np.zeros(n_points) 
fpr = np.zeros(n points) 


pos = true_labels == pos_class 
neg = np.logical_not(pos) 
n_pos = np.count_nonzero(pos) 
n_neg = np.count_nonzero(neg) 


for 1,t in enumerate(thr): 
tpr[1] = np.count_nonzero(np.logical_and(predicted_probs 
>= t, pos)) / float(n_pos) 
fpr[i] = np.count_nonzero(np.logical_and(predicted_probs 
>= t, neg)) / float(n_neg) 


return fpr, tpr, thr 


df_imputed = pd.read csv('df imputed') 


EE , ) 4 NE d x D 
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features = ['revolving utilization of unsecured lines', 
'age', 
"number of time30-59 days past due not worse', 
'debt ratio', 
"monthly income', 
"number of open credit lines and loans', 
"number of times90 days late', 
'number real estate loans or lines', 
"number of time60-89 days past due not worse', 
"number of dependents', 
'income bins', 
age bin', 
"monthly income scaled "| 
y = df imputed.serious dlqgin2yrs 
X = pd.get dummies(df imputed[features], columns = ['income bins' 
, 'age bin']) 


Eee 


from sklearn.cross validation import train test split 
train X, test X, train y, test y = train test split(X, y ,train_ 
size=0./,random_state=1) 


# Randomly generated predictions should give us a diagonal ROC c 
urve 

preds = np.random.rand(len(test_y)) 

fpr, tpr, thr = roc curve(test y, preds) 

plt.plot(fpr, tpr); 





from sklearn.linear_model import LogisticRegression 
clf = LogisticRegression() 

clf.fit(train X,train y) 

preds - clf.predict proba(test X)[:,1] 

fpr, tpr, thr - roc curve(test y, preds) 
plt.plot(fpr, tpr); 





Log loss 
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Log loss, also called logistic regression loss or cross-entropy loss, is defined on 
probability estimates. It is commonly used in (multinomial) logistic regression and 
neural networks, as well as in some variants of expectation-maximization, and can 
be used to evaluate the probability outputs ( predict proba ) of a classifier 
instead of its discrete predictions. 


For binary classification with a true label Y € 10, 1; and a probability estimate 
p = Prly = 1) the log loss per sample is the negative log-likelihood of the 


classifier given the true label: 


Lios (y. p) logPr(y|p) (ylog(p) + (1 — y)log(1 — p)) 


P E 


This extends to the multiclass case as follows. Let the true labels for a set of 


Vip = 1 


samples be encoded as a 1-of-K binary indicator matrix Y , i.e., * if sample 


i has label £ taken from a set of labels. Let P be a matrix of probability 


Pr(t; g 


estimates, with PiF 1) Then the log loss of the whole set is 


Liog(Y, P) — —logPr(Y P) 때 => DE " Dk i Yi klOgPi k 


Lk 
To see how this generalizes the binary log loss given above, note that in the 
binary case, we have #0 一 L — Pil and Vio = 1 — Yil so expanding the inner 
sum over Yi. © 10,1; gives the binary log loss. 


The log loss function computes log loss given a list of ground-truth labels and a 
probability matrix, as returned by an estimator's predict_proba method. 


from sklearn.metrics import log loss 

y true = [0, 0, 1, 1] 

y Pred = [jo a, | .8, 22) T3, 71 [201 99] 
log loss(y true, y pred) 


0.1/380/33669106749 


Hinge loss 
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The hinge loss function computes the average distance between the model and 
the data using hinge loss, a one-sided metric that considers only prediction errors. 
(Hinge loss is used in maximal margin classifiers such as support vector 
machines.) 


If the labels are encoded with +1 and -1, Y: is the true value, and $w$ is the 
predicted decisions as output by decision_function, then the hinge loss is defined 
as: 

L Hing (y, w) = mazx(1 — wy, 0] = |1 — wyl, 

If there are more than two labels, hinge_loss uses a multiclass variant due to 
Crammer & Singer. If Y is the predicted decision for true label and 4 is the 
maximum of the predicted decisions for all other labels, where predicted decisions 
are output by decision function, then multiclass hinge loss is defined by: 


L Hinge (Yw, Ut) = maxil+ Y — Yu, 0} 


from sklearn import svm 

from sklearn.metrics import hinge_loss 

X = [[0], [1]] 

y = [-1, 1] 

est = svm.LinearSVC(random state=0) 

est.fit(X, y) 

pred decision = est.decision function([[-2], [3], [0.5]]) 


print(pred decision) 
print(hinge loss([-1, 1, 1], pred decision)) 


[-2.18173682 2.36360149 20.09093234] 
0.303022554204 


X = np.array([[0], [1], [2], [3]]) 

Y = np.array([0, 1, 2, 3]) 

labels = np.array([0, 1, 2, 3]) 

est = svm.LinearSVC( ) 

est.fit(X, Y) 

pred decision = est.decision _function([[-1], [2], [3]]) 
y true = [0, 2, 3] 

hinge loss(y true, pred decision, labels) 


0.5641235994191/456 


练习 : RIAS t XXE E PRAEMISSA O md 
系数 的 变化 ， 以 及 最 终 的 预测 效果 


Sections 


e Decision trees learning 
o Building a decision tree 
o Visualize a decision tree 
o Different impurity criteria 
Implement a binary decision tree in python 


O 


e Combining weak to strong learners via random forests 
e Learning with ensembles 


o Majority vote classifier 
a VotingClassifier in Sklearn 
m Combining different algorithms for classification with majority vote 
= Evaluating the ensemble classifier 
o Bagging -- Building an ensemble of classifiers from bootstrap samples 
o Leveraging weak learners via adaptive boosting 
e Algorithm implementation 


Decision trees learning 
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Here we'll explore a class of algorithms based on decision trees. Decision trees at 
their root are extremely intuitive. They encode a series of "if" and "else" choices, 
similar to how a person might make a decision. However, which questions to ask, 
and how to proceed for each answer is entirely learned from the data. 


For example, if you wanted to create a guide to identifying an animal found in 
nature, you might ask the following series of questions: 


e |s the animal bigger or smaller than a meter long? 
o bigger. does the animal have horns? 
= yes: are the horns longer than ten centimeters? 


E no: is the animal wearing a collar 

o smaller. does the animal have two or four legs? 
= two: does the animal have wings? 
= four. does the animal have a bushy tail? 


and so on. This binary splitting of questions is the essence of a decision tree. 


One of the main benefit of tree-based models is that they require little 
preprocessing of the data. They can work with variables of different types 
(continuous and discrete) and are invariant to scaling of the features. 


Another benefit is that tree-based models are what is called "nonparametric", 
which means they don't have a fix set of parameters to learn. Instead, a tree 
model can become more and more flexible, if given more data. In other words, the 
number of free parameters grows with the number of samples and is not fixed, as 
for example in linear models. 


Building a decision tree 
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import numpy as np 

from sklearn import datasets 

from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import StandardScaler 


iris = datasets.load iris() 
X = iris.data[:, [2, 3]] 
y = iris.target 


X_train, X_test, y_train, y_test = train_test_split( 
X, y, test size-0.3, random state=0) 


sc = StandardScaler() 

sc.fit(X train) 

X train std = sc.transform(X train) # standard: 
X test std - sc.transform(X test) 


from matplotlib.colors import ListedColormap 


import matplotlib.pyplot as plt 


%matplotlib inline 


def plot de: 
ution=0.02): 


T) 


IS(X, y, classifier, test _idx=None, resol 


markers = ('s', 'x', 'o', '^', 'v') 
colors - ('red', 'blue', 'lightgreen', 'gray', 'cyan') 
cmap = ListedColormap(colors[:len(np.unique(y))]) 


x1 min, x1 max = X[:, 0].min() - 1, X[:, O].max() + 1 
x2 min, x2 max = X[:, 1].min() - 1, X[:, 1].max() + 1 
XX1, xx2 = np.meshgrid(np.arange(x1 min, xi max, resolution) 


np.arange(x2 min, x2 max, resolution)) 
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]). 


Z = Z.reshape(xx1.shape) 

plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) 
plt.xlim(xxi.min(), xxi.max()) 
plt.ylim(xx2.min(), xx2.max()) 


for idx, cl in enumerate(np.unique(y)): 
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], 
alpha=0.8, c=cmap(idx), 
marker=markers[idx], label-cl) 


if test idx: 
X test, y test = X[test idx, :], y[test_idx] 
plt.scatter(X test[:, 0], X test[:, 1], c='', 
alpha=1.0, linewidth=1, marker='0', 
s=55, label='test set') 


决策 树 和 集成 学 习 


from sklearn.tree import DecisionTreeClassifier 

# max depth 3 using entropy for impurofy 

tree = DecisionTreeClassifier(criterion='entropy', max_depth=3, 
random_state=0) 

tree.fit(X_train, y_train) 


X_combined = np.vstack((X_train, X_test)) 
y_combined = np.hstack((y_train, y_test)) 
plot decision regions(X combined, y combined, 
classifier-tree, test idx-range(105,150)) 


plt.xlabel('petal length [cm]') 

plt.ylabel('petal width [cm]') 

plt.legend(loc-'upper left') 

plt.tight layout() 

# plt.savefig('./figures/decision tree decision.png', dpi=300) 


petal width [cm] 


-0.5 





petal length [cm] 


Visualize a decision tree 
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from sklearn.tree import export_graphviz 
# export tree as .dot file, install GraphViz to transfer the fo 
rmat 
export_graphviz(tree, 
out_file='tree.dot', 
feature names-['petal length', ‘petal width' ]) 


# pip install pydotplus 

import pydotplus 

from sklearn.externals.six import StringIo 

from IPython.display import Image 

dot_data = Stringl0() 

export_graphviz(tree, out file-dot data, 
feature names-['petal length', ‘petal width' ], 
class names-iris.target names, 
filled-True, rounded=True, 
special characters=True) 

graph = pydotplus.graph from dot data(dot data.getvalue()) 

Image(graph.create png()) 


决策 树 和 集成 学 习 


petal width < 0.75 
entropy = 1.5799 
samples = 105 
value = [34, 32, 39] 

class = virginica 








¿False 





True / 


petal length < 4.95 ) 

entropy = 0.993 
samples = 71 
value = [0, 32, 39] 








class = virginica 





Different impurity criteria 
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def gini(p): 
return (p)*(1 - (p)) + (1-p)*(1 - (1-p)) 


def entropy(p): 
return - p*np.log2(p) - (1 - p)*np.log2((1 - p)) 


def error(p): 
return 1 - np.max([p, 1 - p]) 


x = np.arange(0.0, 1.0, 0.01) 


ent = [entropy(p) if p != O else None for p in x] 
sc_ent = [e*0.5 if e else None for e in ent] 
err = [error(i) for à in x] 


fig = plt.figure() 
ax = plt.subplot(111) 
for i, lab, ls, c, in zip([ent, sc ent, gini(x), err], 
| Entropy', ‘Entropy (scaled)', 
‘Gini Impurity', 'Misclassification Error'], 
PM E c 


['black', 'lightgray', 'red', 'green', 'cyan'] 


line = ax.plot(x, i, label-lab, linestyle-ls, lw-2, color=c) 


# m E 
ax.legend(loc-'upper center', bbox to anchorz(0.5, 1.15), 
ncol-3, fancybox=True, shadow=False) 


ax.axhline(y=0.5, linewidth=1, color='k', linestyle='--') 
ax.axhline(y=1.0, linewidth=1, color='k', linestyle='--') 
plt.ylim([0, 1.1]) 

plt.xlabel('p(i=1)') 

plt.ylabel('Impurity Index') 

plt.tight layout() 

# plt.savefig('./figures/impurity.png', dpi-300, bbox inches-'ti 
ght) 








Impurity Index 


Implement a binary decision tree 1 
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— Entropy 
Entropy (scaled) 










-- Gini Impurity - Misclassification Error 


from collections import Counter 


import numpy as np 


# 构建 一 个 类 ， 来 表征 二 分 树 的 结构 


n python 


# Tree 里 的 属性 除了 包括 左右 节点 的 Tree 之 外 ， 还 有 此 节点 中 包括 数据 的 标签 及 
EIRE’ REXÃ EA feature 的 idex 
class Tree: 


1 n Binary Tree. = 


def 


def 


. init__(self, labels, split idx-None, 


children left=None, children rightzNone): 


self.children left - children left 
self.children right - children right 
self.labels - labels 

self.split idx - split idx 

self.entropy - calc entropy(self.labels) 


predict(self): 
most freq - np.bincount(self.labels).argmax() 


# find mo 


st frequent element 
return most_freq 


def repr (self, level=0): 
""" make it easy to visualize a tree""" 
prefix = "\t" * level 
string = prefix + "entropy = {}, labels = {}, [Os, 1s] = 
{}\n". format ( 
self.entropy, self.labels, np.bincount(self.labels, 
minlength=2 ) ) 
if self.split_idx is not None: 
string += prefix + "split on Column {}\n".format(sel 
f.split_idx) 
string += pref1x + "True:\n" 
string += self.children left. repr (level+1) 
string += prefix + "False:\n" 
string += self.children right. repr (level+1) 
return string 


# TARARE € 0144 
def calc_entropy(labels): 
""" calculate entropy from an array of labels""" 
size = float(len(labels)) 
cnt = Counter(labels) 
entropy = 0 
for label in set(labels): 
prob = cnt[label] / size 
entropy += -1 * prob * np.log2(prob) 
return entropy 


# 不 同 的 决 采 树 算法 (do ID3, C4.5, CART 等 ) 会 用 不 同 的 标准 来 选择 要 切 分 的 
feature 
# 这 里 使 用 的 是 Information Gain» FP feature DAA Má EI 
def choose best feature to split(features, labels): 
""" choose the best split feature which maximize information 
gain """ 
num features = features.shape[1] 
base entropy - calc entropy(labels) 


best_info_gain = 0 
best feature = None 


for 1 in range(num features): 
new entropy = 0 
for value in [0, 1]: 
new labels - labels[features[:, i] -- value] 
weight - float(len(new labels)) / len(labels) 
new entropy += weight * calc entropy(new labels) 
info gain - base entropy - new entropy 
if info gain » best info gain: 
best info gain - info gain 
best feature = 1 
return best feature 


def create decision tree(features, labels, 
current depth=0, max depth=10): 
""" recursively create tree """ 
tree - Tree(labels) 


# define stop condition 
# stop when all data in this node are from the same class 
if len(set(labels)) -- 1: 
return tree 
# stop when max depth are reached 
if current depth >= max depth: 
return tree 


# split on the best feature found 
best feature - choose best feature to split(features, labels 


1f best feature is None: 
return tree 


# recursively build subtrees 

msk = (features[:, best feature] == 1) 
tree.split idx - best feature 
tree.children left = create decision tree( 


features[msk], labels[msk], current_depth+1) 
tree.children right - create decision tree( 
features[-msk], labels[-msk], current depth+1) 


return tree 


data = np.array([[1, ©], 


[1, 1], 
[o, 1], 


[9, 6]]) 


labels - np.array([1, 0, 0, 0]) 


tree - create decision tree(data, labels) 


tree 


entropy = 0.811278124459, labels 


split on Column 
True: 


0 


entropy = 1.0, labels = [1 0], 
split on Column 1 


True: 
entropy 
False: 
entropy 
False: 
entropy = 0. 


O, 


[1 0 O 0], 


[Os, 1s] = 


0.0, labels = [0], [Os, 1s] 


0.0, labels = [1], [Os, 1s] 


labels = [0 0], 


[Os, 1S] = 


[Os, 1s] = [3 1] 


[1 1] 


= [1 0] 


= [0 1] 


[2 0] 


def í \ (tree, data_row): 
""" prediction for new data""" 

1f tree.split_idx is None: 

return tree.predict() 


split_idx = tree.split_idx 
if data row[split idx]: + a fi | : 
return | classify(tree.children left, data row) 
else: 
return classify(tree.children right, data row) 


def S (tree, data): 
data = np.array(data) 
num row = data.shape[0] 
results = np.empty(shape=num_row) 
for i in range(num row): 
results[1] = _classify(tree, data[i, :]) 
return results 


new_data = [[0, 0], 
[1, 0]] 


classify(tree, new_data) 


array([ ©., 1.]) 


Combining weak to strong learners via 
random forests 
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可 以 看 做 是 ensemble of decision trees, 将 弱 的 模型 结合 在 一 起 变 成 强 模 型 . LIT 
Je, L&Y 4 overfitting. 


1. draw a random bootstrap sample of size n (with replacement) 
2. grow decision tree from bootstrap sample, at each node: 
o randomly select d features without replacement 
o split node using feature that provides best split 
3. repeat 1 & 2 k times. 
4. aggregate the prediction by each tree to assign the class label by majority 
vote 


from sklearn.ensemble import RandomForestClassifier 
forest = RandomForestClassifier(criterion='entropy', 
n estimators=10, 
random_state=1, 
n_jobs=2 ) 
forest.fit(X_train, y_train) 


plot_decision_regions(X_combined, y_combined, 
classifier=forest, test_idx=range(105,150) 


plt.xlabel('petal length [cm]') 
plt.ylabel('petal width [cm]') 
plt.legend(loc-'upper left') 
plt.tight layout() 


SULETTO 


petal width [cm] 





petal length [cm] 


Learning with ensembles 


将 一 系列 分 类 器 集合 起 来 , 取 多 数 为 分 类 结果 
Build powerful models from weak learners that learn from their mistakes 


ensemble method 就 是 讲 多 个 不 同 的 分 类 器 集合 组 合 为 一 个 大 的 分 类 器 . HERA 
25 A Æ YA majority voting 


即使 每 个 单独 的 分 类 器 错误 率 较 高 , 但 将 多 个 分 类 器 组 合 之 后 , ARRA AKAR 
[back to top] 


假设 我 们 组 合 了 mn 个 分 类 器 ， 它 们 的 错误 这 都 为 5, 各 个 分 类 器 之 间 独 立 。 
mi n 个 分 类 器 里 , 多 于 k 个 分 类 器 分 类 错误 的 概率 为 
k 


Ply > k) = ER (e - 9^ 
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from scipy.misc import comb 
import math 


4 emsemble error rate 
def ensemble error(n_classifier, error): 
k_start = int(math.ceil(n_classifier / 2.0)) 
probs = [comb(n_classifier, k) * error**k * 
(1-error)**(n_classifier - k) 
for k in range(k_start, n_classifier + 1)] 
return sum(probs) 


# 11 个 分 类 器 ， 每 个 分 类 器 的 error rate 是 0.25 的 话 ， 通 过 combinator 之 后 
的 error rate 
ensemble_error(n_classifier=11, error=0.25) 


0.03432/50/019042969 


# ensemble error 和 base error 的 关系 

error_range = np.arange(0.0, 1.01, 0.01) 

ens errors = [ensemble error(n classifier=11, error=error) 
for error in error range] 


plt.plot(error range, ens errors, 
label=' Ensemble error', linewidth=2) 


plt.plot(error range, error range, 
linestyle-'--', label='Base error',linewidth=2) 


plt.xlabel('Base error') 

plt.ylabel('Base/Ensemble error') 
plt.legend(loc-'upper left') 

plt.grid() 

plt.tight layout() 

# plt.savefig('./figures/ensemble err.png', dpi-300) 


nu 
Ed 


Base/Ensemlbhie err 





E < 0.5 时 , emsemble error 都 要 低 于 base error, € > 0.5 时 , emsemble error 就 会 
KT base error 


Majority vote classifier 


combine different classication algorithms associated with individual weights for 
confidence 
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Du 


当 多 个 分 类 器 C 拥有 相 — imbri 给 出 的 预测 Y ARR: 


y = modelCi(x), Co(x),...,Cm(x)} 
六 分 类 器 Ci 对 应 不 同 的 权重 201 则 y = arg max; >. WP 中 P, 是 Ci 


预测 结果 为 i SNL 


import numpy as np 
np.argmax(np.bincount([0, 9, 1], 
welghts=[0.2, 0.2, 0.6])) 


# np.argmax: returns the indices of the maximum values along an 
axis. 

# np.bincount: Count number of occurrences of each value in arra 
y of non-negative ints 


np.bincount([0, O, 1], 
welghts=[0.2, 0.2, 0.6]) 


array([ 0.4, 0.6]) 


ex = np.array([[0.9, 0.1], # 01 的 预测 结果 
[0.8, 0.2], # C2 的 预测 结果 
[0.4, 0.6]]) & C3 的 预测 结果 


p = np.average(ex, 


ax1s=0, 
weights=[0.2, 0.2, 0.6]) + 01, C2, C3 的 权重 


array([ 0.58, 0.42]) 


| 
= 


p(in|x) = 0.58 p(i, |x) = 0.42 y = arg max; |p(?o|z), plii |x) 


np.argmax(p) 


VotingClassifier in Sklearn 


使 用 Sklearn ¥ È 4$ 89 VotingClassifier 
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import numpy as np 

from sklearn.linear_model import LogisticRegression 
from sklearn.naive bayes import GaussianNB 

from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import VotingClassifier 


H RH 3 PRS 

clf1 = LogisticRegression(random state=1) 
clf2 = RandomForestClassifier(random_state=1) 
clf3 = GaussianNB() 


HE RIE 

= np.array([[-1, -1], GA iq [ -3, -2], [1, 1], [25 1], [3, 2 
11) 

y = np.array([1, 1, 1, 2, 2, 2]) 


# voting='hard', uses predicted class labels for majority rule v 
oting. 
eclf1 = VotingClassifier(estimators-[ 

('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard' 
) 
60111 = eclf1.fit(X, y) 
print(eclfi.predict(X)) 


# voting-'soft', predicts the class label based on the argmax of 
the sums of the predicted probalities 
eclf2 - VotingClassifier(estimators-[ 
(Cro cit Aeree qub 6103), 
voting-'soft') 
eclf2 - eclf2.fit(X, y) 
print(eclf2.predict(X) ) 


# add weight 

eclf3 = VotingClassifier(estimators-[ 
(REC cnp erre, 
voting-'soft', weights=[2,1,1]) 

eclf3 = eclf3.fit(X, y) 

print(eclf3.predict(X)) 


ad Ea 


[11122 2] 
[111 2 2 2] 
[1112 2 2] 


Combining different algorithms for 
classification with majority vote 
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from sklearn import datasets 

from sklearn.cross_validation import train_test_split 
from sklearn.preprocessing import StandardScaler 

from sklearn.preprocessing import LabelEncoder 


iris = datasets.load iris() 
= iris.data[50:, [1, 2]], iris.target[50:] 


x 
< 
| 


st = StandardScaler() 
X = st.fit_transform(X) 


le = LabelEncoder( ) 


y = le.fit_transform(y) 


X_train, X_test, y_train, y_test = \ 
train test split(X, y, test size-0.5, random state=1) 


from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier 

from sklearn.cross_validation import cross_val_score 


clf1 = LogisticRegression(C=0.01, random state=42) 
clf2 = KNeighborsClassifier(n_neighbors=1) 
clf3 = DecisionTreeClassifier(max_depth=1, random state=42) 


clf labels = ['Logistic Regression', 'KNN', 'Decision Tree' ] 
all clf = [clf1, clf2, clf3] 


print('10-fold cross validation:\n'!) 
for clf, label in zip(all clf, clf labels): 
scores = cross val score(estimator=clf, X=X_train, y=y_train 


cv=10, scoring='roc_auc') 
print( "ROC AUC: %0.2f (+/- %0.2f) [%s]" 
% (scores.mean(), scores.std(), label)) 


10-fold cross validation: 


ROC AUC: 0.93 (+/- 0.15) [Logistic Regression] 
ROC AUC: 0.93 (+/- 0.10) [KNN] 
ROC AUC: 0.92 (+/- 0.15) [Decision Tree] 


from sklearn.ensemble import VotingClassifier 


mv clf = VotingClassifier( 
estimators=|('c1', 0111), ('c2', clf2), ('c3', clf3)], votin 
g-'soft') 


clf labels += ['Majority Voting'] 
all clf += [mv clf] 


print('10-fold cross validation:*n') 
for clf, label in zip(all clf, clf labels): 
scores = cross val score(estimator=clf, X=X train, y-y train 


CV-10, scoring-'roc auc') 


print("ROC AUC: %0.2f (+/- %0.2f) [%s]" 
% (scores.mean(), scores.std(), label)) 


10-fold cross validation: 
ROC AUC: 0.93 (+/- 0.15) [Logistic Regression] 
ROC AUC: 0.93 (+/- 0.10) [KNN] 


ROC AUC: 0.92 (+/- 0.15) [Decision Tree] 
ROC AUC: 0.97 (+/- 0.10) [Majority Voting] 


最 后 一 个 是 majority voting, 明显 比 单独 的 分 类 器 结果 好 


Evaluating the ensemble classifier 


[back to top] 


在 测试 集 上 评估 各 个 分 类 器 的 ROC AUC 


from sklearn.metrics import roc_curve 
from sklearn.metrics import auc 


colors = ['black', 'orange', 'blue', ‘green' ] 
linestyles = [':', '--', '-.!, ‘-' 
for clf, label, clr, ls A 
in zip(all clf, clf labels, colors, linestyles): 


# assuming the label of the positive class is 1 
y pred = clf.fit(X train, y train).predict proba(X test)[:, 1 


for, tpr, thresholds = roc curve(y true-y test, 
y Score-y pred) 
roc auc - auc(x-fpr, y-tpr) 
plt.plot(fpr, tpr, 
color=clr, 
linestyle=ls, 
label='%s (auc = %0.2f)' % (label, roc auc)) 


plt.legend(loc='lower right' ) 

plt.plot([0, 1], [0, 1], 
linestyle='--', 
color='gray', 
linewidth=2 ) 


plt.xlim([-0.1, 1.1]) 
plt.ylim([-0.1, 1.1]) 

plt.grid() 

plt.xlabel('False Positive Rate') 
plt.ylabel('True Positive Rate') 


plt.tight layout() 
# plt.savefig('./figures/roc.png', dpi-300) 
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Logistic Regression (auc — 0.94) 
-- KNN (auc = 0.86) 
--- Decision Tree (auc — 0.89) 
— Majority Voting (auc = 0.95) 


- 
he 
False Positive Rate 


ROC * & #, ensemble classfifier 在 test set 上 表现 不 错 


xt eR LAM 


from itertools import product 


x min 
x max 
y min 
y max 


X train[:, 0].min() 
X train[:, 0].max() + 
X train[:, i].min() 
X train[:, i].max() + 


pH HA 


XX, yy = np.meshgrid(np.arange(x min, x max, 0.1), 


f, axarr 


for idx, 


np.arange(y min, y max, 0.1)) 


È a Lo zm. | — s om 


= plt.subplots(nrows=2, ncols=2, 
sharex='col', sharey='row', 
figsize=(/, 5)) 


clf, tt in zip(product([0, 1], [9, 1]), 
all clf, clf labels): 


clf.fit(X train, y train) 


Z 
Z 


clf.predict(np.vstack([xx.ravel(), yy.ravel()]).T) 
Z.reshape(xx.shape) 


axarr[idx[0], 1dx[1]].contourf(xx, yy, Z, alpha=0.3) 
axarr[idx[0], 1dx[i]].scatter(X_train[y_train==0, 0], 


X_train[y_train==0, 1], 


决策 树 和 集成 学 习 


c='blue', markerz'^', s=50) 
axarr[idx[0], idx[i]].scatter(X_train[y_train==1, 0], 

X_train[y_train==1, 1], 

c='red', marker='o', s=50) 
axarr[idx[0], 1dx[1]].set_title(tt) 


plt.text(-3.5, -4.5, 
s='Sepal width [standardized]', 
ha='center', va='center', fontsize=12) 
plt.text(-10.5, 4.5, 
s-'Petal length [standardized]', 
ha-'center', va='center', 
fontsize=12, rotation=90) 


plt.tight layout() 
# plt.savefig('./figures/voting panel', bbox_inches='tight', dpi 
-300) 
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Bagging -- Building an ensemble of 
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classifiers from bootstrap samples 


e draw bootstrap samples (random samples with replacement) from initial 
training set 

e random forests are a special case of bagging where we also use random 
feature subsets to fit the individual decision trees 


[back to top] 


import pandas as pd 
4 wine dataset 
df wine = pd.read csv('ftp://ftp.ics.uci.edu/pub/machine-learnin 
g-databases/wine/wine.data', 
header=None) 


df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash' 
Alcalinity of ash', 'Magnesium', 'Total phenols', 

‘Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 

‘Color intensity', 'Hue', '0D280/0D315 of diluted wines', 'Proli 
ne' | 


# only consider Wine classes 2 and 3 
df wine = df wine[df wine['Class label'] != 1] 


= df wine['Class label'].values 
X = df wine[['Alcohol', 'Hue']].values 


< 
| 


from sklearn.preprocessing import LabelEncoder 
from sklearn.cross_validation import train_test_split 


H 转换 label 
le = LabelEncoder( ) 
y = le.fit_transform(y) 


# 60% train, 40% test 
X_train, X_test, y_train, y_test = \ 
train test split(X, y, test size-0.40, random state=1) 


# sklearn 提供 的 BaggingClassifier: 其 实 功能 已 经 超过 Bagging 了 
H 它 既 能 对 samples 采样， 也 能 对 features 采样 

from sklearn.ensemble import BaggingClassifier 

from sklearn.tree import DecisionTreeClassifier 


tree = DecisionTreeClassifier(criterion='entropy') 


# 用 Decision Tree 4f base 
bag = BaggingClassifier(base_estimator=tree, 
n_estimators=500, 


"2 


max samples=1.0, # FA%? samples 的 比例 


NA 


max_features=1.0, + f A features 的 比例 

bootstrap-True, # 采样 samples 时 是 否 使 
用 bootstrap 

bootstrap features-False, # 采样 feature 
s 时 是 否 使 用 bootstrap 

random_state=1) 


SS M 














from sklearn.metrics import accuracy_score 


tree = tree.fit(X_train, y_train) 
y_train_pred = tree.predict(X_train) 
y_test_pred = tree.predict(X_test) 


tree train = accuracy_score(y_train, y train pred) 

tree test - accuracy score(y test, y test pred) 

print('Decision tree train/test accuracies %.3f/%.3f' 
% (tree train, tree test)) 


bag - bag.fit(X train, y train) 
y train pred - bag.predict(X train) 
y test pred - bag.predict(X test) 


bag train - accuracy score(y train, y train pred) 
bag test = accuracy score(y test, y test pred) 
print('Bagging train/test accuracies %.3f/%.3f' 

% (bag train, bag test)) 


Decision tree train/test accuracies 1.000/0.854 
Bagging train/test accuracies 1.000/0.896 


使 用 Bagging 之 后 ， 测 试 集 的 准确 率 有 提升 


%matplotlib inline 
import numpy as np 
import matplotlib.pyplot as plt 


x_min = X_train[:, 0].min() - 
x max = X_train[:, 0].max() + 
y min = X train[:, 1].min() 


PR H B 


y_max = X_train[:, 1].max() + 


xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), 
np.arange(y_min, y_max, 0.1)) 


f, axarr = plt.subplots(nrows=1, ncols=2, 
sharex='col', 
sharey='row', 
figsize=(8, 3)) 


for idx, clf, tt in zip([0, 1], 
[tree, bag], 
| ‘Decision Tree', 'Bagging']): 
clf.fit(X_train, y_train) 


N 
Il 


clf.predict(np.vstack([xx.ravel(), yy.ravel()]).T) 


N 
Il 


Z.reshape(xx.shape) 


axarr[idx].contourf(xx, yy, Z, alphaz0.3) 

axarr[idx].scatter(X train[y train==0, 0], 
X train[y train==0, 1], 
c-'blue', markerz'^') 


axarr[idx].scatter(X train[y train==1, 0], 
X train[y train==1, 1], 
c='red', marker='0') 


axarr[idx].set title(tt) 


axarr[0].set_ylabel('Alcohol', fontsize=12) 
plt.text(10.2, -1.2, s='Hue', 
ha='center', va='center', fontsize=12) 


plt.tight layout() 
Zplt.savefig('./figures/bagging region.png', 
H dpi=300, 

H bbox_inches='tight' ) 





Decision Tree Bagging 


Bagging 减少 了 overfitting ， 使 决策 边界 更 平滑 


Leveraging of weak learners via adaptive 
boosting 


let the weak learners subsequently learn from misclassified training samples to 


improve the performance of the ensemble 


[back to top] 


Adaptive boosting (AdaBoost) 


> 
. Set weight vector w to uniform weights where —: 


. Predict class labels: ^ 


e" CK. | 
. Compute coefficient: ~ 2 i 


. Update weights: dk 


. Compute final prediction: 


w;-—l 


. Forj in m boosting rounds, do the following: 


1 
2 
3. 
4 
5 


Train a weighted weak learner: C; = train( X, y, w) 


predict(C;, X) 


A 
= 


. Compute weighted error rate: € WU (Y F Y). 





| l 
一 ln 


w x ezp(—a; x y x y) 





! 1 E 
LU . V 


. Normalize weights to sum to 1: ei 


y = ( Vi: (o; x predict(C;, X)) > 


0) 


: denotes dot product between two vectors 
x denotes element-wise multiplication of two vectors 


from sklearn.ensemble import AdaBoostClassifier 


tree = DecisionTreeClassifier(criterion='entropy', 
max_depth=1) 


ada = AdaBoostClassifier(base_estimator=tree, 
n_estimators=500, 
learning_rate=0.1, 
random_state=0) 


tree = tree.fit(X_train, y_train) 
y_train_pred = tree.predict(X_train) 
y_test_pred = tree.predict(X_test) 


tree train = accuracy_score(y_train, y train pred) 

tree test - accuracy score(y test, y test pred) 

print('Decision tree train/test accuracies %.3f/%.3f' 
% (tree train, tree test)) 


ada - ada.fit(X train, y train) 
y train pred - ada.predict(X train) 
y test pred - ada.predict(X test) 


ada train = accuracy score(y train, y train pred) 

ada test - accuracy score(y test, y test pred) 

print('AdaBoost train/test accuracies %.3f/%.3f' 
% (ada train, ada test)) 


Decision tree train/test accuracies 0.845/0.854 
AdaBoost train/test accuracies 1.000/0.875 


Adaboost 可 以 减少 Bias > 427) 485] A € £ #4 Variance 


x min, x max = X train[:, 0].min() - 1, X train[:, 0].max() + 1 

y min, y max = X train[:, i].min() - 1, X train[:, i].max() + 1 

XX, yy - np.meshgrid(np.arange(x min, x max, 0.1), 
np.arange(y min, y max, 0.1)) 


f, axarr = plt.subplots(1, 2, sharex='col', sharey='row', figsiz 
e-(8, 3)) 


for idx, clf, tt in zip([0, 1], 
[tree, ada], 
[' Decision Tree', 'AdaBoost']): 
clf.fit(X train, y train) 


N 
| 


= clf.predict(np.vstack([xx.ravel(), yy.ravel()]).T) 


N 
Il 


Z.reshape(xx.shape) 


axarr[idx].contourf(xx, yy, Z, alpha=0.3) 
axarr[idx].scatter(X train[y train==0, 0], 
X train[y train==0, 1], 
c-'blue', markerz'^') 
axarr[idx].scatter(X train[y train==1, 0], 
X train[y train==1, 1], 
c='red', markerz'o') 
axarr[idx].set title(tt) 


axarr[0].set_ylabel('Alcohol', fontsize=12) 
plt.text(10.2, -1.2, s='Hue', 


ha='center', va='center', fontsize=12) 


plt.tight layout() 


Decision Tree AdaBoost 





Adaboost #9 X X dt tree X 4º, 与 BaggingClassifier 相似 . 


Ensemble method 需要 更 多 的 计算 资源 , 这 个 在 实际 运用 中 也 是 要 考虑 的 . 


Algorithm Implementation 


[back to top] 
广义 提升 树 算 法 详解 


import numpy 

import matplotlib.pyplot as plot 

%matplotlib inline 

from sklearn import tree 

from sklearn.tree import DecisionTreeRegressor 
from math import floor 

import random 


n points = 1000 


x_plot = [(float(1) / float(n_points) - 0.5) for 1 in range(n_po 
ints + 1)] 


# x needs to be list of lists. 
x = [[s] for s in x_plot] 


# y (labels) has random noise added to x-value 

# set seed 

numpy.random.seed(1) 

y = [s + numpy.random.normal(scale=0.1) for s in x plot] 


4 take fixed test set 30% of sample 

n sample = int(n points * 0.30) 

idx test - random.sample(range(n points), n sample) 

idx test.sort() 

idx train - [idx for idx in range(n points) if not (idx in idx t 
est)] 


4 Define test and training attribute and label sets 
x train = [x[r] for r in idx train] 

x test = [x[r] for r in idx test] 

y train - [y[r] for r in idx train] 

y test - [y[r] for r in idx test] 


4 train a series of models on random subsets of the training data 


# collect the models in a list and check error of composite as 1 
ist grows 


4 maximum number of models to generate 
num trees max - 30 


# tree depth - typically at the high end 
tree depth = 5 


4 initialize a list to hold models 
mode list = [] 

pred list - [] 

eps - 0.3 


4 initialize residuals to be the labels y 
residuals - list(y train) 


for i trees in range(num trees max): 
mode list.append(DecisionTreeRegressor(max depth-tree depth) 


mode list[-1].fit(x train, residuals) 


latest in sample prediction = mode list[-3].predict(x train) 


residuals - [residuals[i] - eps * latest in sample predictio 
n[i] 


for i in range(len(residuals))] 


latest out sample prediction = mode list[-1].predict(x test) 
pred list.append(list(latest out sample prediction)) 


mse = [|] 
all predictions = [|] 
for 1 models in range(len(mode_list)): 


prediction = [|] 
for 1 pred in range(len(x_test)): 
prediction. append( 
sum([pred list[i][i pred] for i in range(i models + 1 


)1) * eps) 


all predictions.append(prediction) 

errors = [(y test[i] - prediction[i]) for 1 in range(len(y t 
est))] 

mse.append(sum([e * e for e in errors]) / len(y test)) 


n models = [i + 1 for i in range(len(mode list))] 


A JE) 


Mean Squared Error 


plot.plot(n_models, mse) 

plot.axis('tight') 

plot.xlabel('Number of Models in Ensemble’ ) 
plot.ylabel('Mean Squared Error') 
plot.ylim((0.0, max(mse))) 

plot.show( ) 


plot list = [0, 14, 29] 

line type = [':', '-.', '--"] 

plot.figure() 

for 1 in range(len(plot list)): 
i plot - plot list[i] 
text legend = 'Prediction with ' + str(i plot) + ' Trees' 
plot.plot(x test, all predictions[i plot], label-text legend 


linestyle-line type[i]) 
plot.plot(x test, y test, label-'True y Value', alpha=0.25) 
plot.legend(bbox_to_anchor=(1, 0.3)) 
plot.axis('tight') 
plot.xlabel('x value') 
plot.ylabel('Predictions'); 


Lue 
c 
Lal 


토고 
c 
pr. 





5 10 15 20) 25 30 


Humber of Models in Ensemble 


Predictians 


Prediction with O Trees 
Prediction with 14 Trees 
Prediction with 29 Trees 
True y Value 





随机 和 森林 算法 详解 


import urllib2 

import numpy 

from sklearn import tree 

from sklearn.tree import DecisionTreeRegressor 
import random 

from math import sqrt 

import matplotlib.pyplot as plot 


# read data into iterable 

target_url = "ftp://ftp.ics.uci.edu/pub/machine-learning-databas 
es/wine-quality/winequality-red.csv" 

data = urllib2.urlopen(target_url) 


x_list [ | 
labels = [] 
names = [] 
first_line = True 
for line in data: 
if first_line: 
names = line.strip().split(";") 
first_line = False 
else: 
# split on semi-colon 


row = line.strip().split(";") 

# put labels in separate array 
labels.append(float(row[-1])) 

# remove label from row 

row.pop() 

# convert row to floats 

float_row = [float(num) for num in row] 
x list.append(float row) 


nrows = len(x list) 
ncols - len(x_list[0]) 


4 take fixed test set 30% of sample 

random.seed(1) + set seed so results are the same each run 

n sample = int(nrows * 0.30) 

idx test - random.sample(range(nrows), n sample) 

idx test.sort() 

idx train = [idx for idx in range(nrows) if not (idx in idx test 


)] 


# Define test and training attribute and label sets 
x train = [x list[r] for r in idx train] 

x test = [x list[r] for r in idx test] 

y train = [labels[r] for r in idx train| 

y test = [labels[r] for r in idx test] 


4 train a series of models on random subsets of the training data 


# collect the models in a list and check error of composite as 1 
ist grows 


4 maximum number of models to generate 
num trees max - 30 


# tree depth - typically at the high end 
tree depth - 12 


# pick how many attributes will be used in each model. 
# authors recommend 1/3 for regression problem 
n attr = 4 


mode list = [| 
index list = [|] 
pred list = [|] 
n_train_rows = len(y_train) 


for i_trees in range(num_trees_max): 


mode list.append(DecisionTreeRegressor(max depth-tree depth) 


idx attr = random.sample(range(ncols), n attr) 
idx attr.sort() 
index list.append(idx attr) 


idx rows - [] 
for 1 in range(int(0.5 * n train rows)): 

idx rows.append(random.choice(range(len(x train)))) 
idx rows.sort() 


x rf train 


| | 
[1 


y rf train 


for 1 in range(len(idx rows)): 
temp - [x train[idx rows[i]][j] for j in idx attr] 
x rf train.append(temp) 
y rf train.append(y train[idx rows[1]]) 


mode Jist[-1].fit(x rf train, y rf train) 


x rf test - [] 

for xx in x test: 
temp = [xx[i] for 1 in idx_attr] 
x_rf_test.append(temp) 


latest out sample prediction = mode list[-1].predict(x rf te 
st) 
pred list.append(list(latest out sample prediction)) 


# build cumulative prediction from first "n" models 
mse = [] 

all predictions - [] 

for i models in range(len(mode list)): 


# add the first "iModels" of the predictions and multiply by 
eps 
prediction - [] 
for i pred in range(len(x test)): 
prediction.append( 
sum([pred list[i][i pred] for i in range(i models + 1 
)]) 7 ( 


i models + 1)) 


all_predictions.append(prediction) 

errors = [(y test[i] - prediction[i]) for i in range(len(y t 
est))] 

mse.append(sum([e * e for e in errors]) / len(y test)) 


n models - [i + 1 for i in range(len(mode list))] 


plot.plot(n models, mse) 
plot.axis('tight') 

plot.xlabel('Number of Trees in Ensemble’ ) 
plot.ylabel('Mean Squared Error') 
plot.ylim((0.0, max(mse))) 

plot.show( ) 


print('Minimum MSE' ) 
print(min(mse) ) 
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Minimum MSE 
0.389088116065 
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特征 工程 


Sections 


e What is Feature Engineering? 
e Data preprocessing 


o Dealing with missing data 
= Eliminating samples or features with missing values 
= imputing missing values 

o Handling categorical data 
= Mapping ordinal features 
= Encoding class labels 
m Performing one-hot encoding on nominal features 

o Partitioning a dataset in training and test sets 

o Bringing features onto the same scale 

e Feature selection 


o Univariate statistics 
o Recursive feature elimination 
o Feature selection using SelectFromModel 
m | 1-based feature selection 
m Tree-based feature selection 
e Feature extraction 


o Unsupervised dimensionality reduction via principal component analysis 
m [otal and explained variance 
m Feature transformation 
= Principal component analysis in scikit-learn 
o Supervised data compression via linear discriminant analysis 
= Computing the scatter matrices 
= Selecting linear discriminants for the new feature subspace 
= Projecting samples onto the new feature space 
m LDA via scikit-learn 
o Using kernel principal component analysis for nonlinear mappings 
= Implementing a kernel principal component analysis in Python 
= Example 1: Separating half-moon shapes 
= Example 2: Separating concentric circles 
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= Projecting new data points 
= Kernel principal component analysis in scikit-learn 
e Using regularization 
o Ridge Regression 
o LASSO Regression 
o Logistic regression with regularization 


What is Feature Engineering? 


[back to top] 


Feature engineering is the process of transforming raw data into features that 
better represent the underlying problem to the predictive models, resulting in 
improved model accuracy on unseen data. 


Sub-Problems of Feature Engineering 


e Feature Importance: An estimate of the usefulness of a feature 
e Feature Selection: From many features to a few that are useful 
e Feature Extraction: The automatic construction of new features from raw data 
e Feature Construction: The manual construction of new features from raw data 


Iterative Process of Feature Engineering 


e Brainstorm features: Really get into the problem, look at a lot of data, study 
feature engineering on other problems and see what you can steal. 

e Devise features: Depends on your problem, but you may use automatic 
feature extraction, manual feature construction and mixtures of the two. 

e Select features: Use different feature importance scorings and feature 
selection methods to prepare one or more “views” for your models to operate 
upon. 

e Evaluate models: Estimate model accuracy on unseen data using the chosen 
features. 


General Examples of Feature Engineering 


e Decompose Categorical Attributes 
o Imagine you have a categorical attribute, like "Item Color" that can be 
Red, Blue or Unknown. 
e Decompose a Date- lime 
o A date-time contains a lot of information that can be difficult for a model 
to take advantage of in it's native form, such as ISO 8601 (i.e. 2014-09- 
20T20:45:40Z). 
e Reframe Numerical Quantities 
o Your data is very likely to contain quantities, which can be reframed to 
better expose relevant structures. This may be a transform into a new 
unit or the decomposition of a rate into time and amount components. 


Data preprocessing 


Dealing with missing data 


[back to top] 


import numpy as np 
import pandas as pd 


df = pd.DataFrame(np.arange(1, 13).reshape(3, 4), 
columns=['A', 'B', 'C', 'D']) 


df.loc[1, 'C'] = None 
df.loc[2, 'D'] = None 


df 


0 1 2 3.0 4.0 
1 5 6 NaN 8.0 
2 9 10 11.0 NaN 


df.isnull() 


A B C 
0 False False False False 
1 False False True False 
2 False False False True 


df.isnull().sum() 


A 
B 
C 
D 


e e © o 


dtype: int64 


Eliminating samples or features with missing 
values 


处 理 缺 失 值 最 简单 的 方法 就 是 删 掉 有 缺失 的 行 或 者 列 


[back to top] 


df .dropna() 


df.dropna(axis=1) 


3.0 


11.0 


3.0 


4.0 


10 


4.0 


8.0 


4.0 


NaN 


看 上 去 删除 是 很 简便 的 处 理 方法 , 但 实际 上 直接 删除 可 能 会 丢失 不 少 信 息 , 更 好 的 选 
1E EAE th dd A à 


Imputing missing values 
估计 缺失 值 并 填充 , 最 普遍 的 是 mean imputation, 也 就 是 用 平均 值 填充 
[back to top] 
from sklearn.preprocessing import Imputer 
imr = Imputer(missing values='NaN', strategy='mean', axis=0) 
imr = imr.fit(df.values) 


imputed_data = imr.transform(df.values) 
imputed_data 


H 
+ 
LI 


array([[ 


[ 
[ 9., 10., 11., 6.]]) 


O1 
Y O 
00 
Li 


df.values 
array([[ 1., 2, 3, 4.], 
| 5., , nan, 8.], 


[ 9., 10., 11., nan]]) 


Handling categorical data 


对 categorical 需要 区 分 nominal 和 ordinal 两 种 类 型 , nominal 是 无 序 的 , 而 ordinal 
KAFKI 


[back to top] 


import pandas as pd 

df = pd.DataFrame( [ 
| 91660, toa 
[ES sedisse 
| ' 0146: , XG 15 S Class] 


df.columns = ['color', 'size', 'price', 'classlabel'] 
df 


color size price classlabel 
0 green M 10.1 class? 
1 red L 13.5 class2 
2 blue XL 15.3 class1 


e color :nominal feature 
e size :ordinal feature, XL>L>M 
e price : numerical feature 


Mapping ordinal features 


convert the categorical string values into integers 
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size mapping = { 


DNE MC 
A Eos 
'M': 1} 


df['size'] = df['size'].map(size_mapping) 


df 
color size price classlabel 
0 green 1 10.1 class1 
1 red 2 1959 class2 
2 blue 3 15.3 class1 


inv size mapping = (v: k for k, v in size mapping.items()} 
df['size'].map(inv size mapping) 


0 M 
1 L 
2 XL 


Name: size, dtype: object 


Encoding class labels 


对 应 nominal 的 class labels, 也 需要 将 其 转换 为 数值 表征 ， 记 住 此 时 的 数值 只 代表 
一 个 类 别 ， 并 不 表征 数值 关系 
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import numpy as np 


class_mapping = {label:idx for idx,label in 
enumerate(np.unique(df['classlabel']))) 
class mapping 


i'classi1': 0, 'class2': 1) 


df['classlabel'] = df['classiabel'].map(class mapping) 
df 


color size price classlabel 
0 green 1 10.1 0 
1 red 2 135 1 
2 blue 3 15.3 0 


inv class mapping = (v: k for k, v in class mapping.items()) 
df['classlabel'] = df['classiabel'].map(inv class mapping) 
df 


color size price classlabel 
0 green 1 10.1 class? 
1 red 2 13.5 class2 
2 blue 3 15.3 class1 


from sklearn.preprocessing import LabelEncoder 


class le = LabelEncoder() 
y = class le.fit transform(df['classlabel'].values) 


y 


array([0, 1, 0]) 


class_le.inverse_transform(y) 


array(['class1', 'class2', 'class1'], dtype=object) 


Performing one-hot encoding on nominal features 
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X = df[['color', 'size', prlce ]].values 


color le = LabelEncoder() 
X[:, 0] = color le.fit transform(X[:, 0]) 
X 


array([[1, 1, 10.1], 
[2, 2, 13.5], 
[0, 3, 15.3]], dtype-object) 


虽然 color 转化 为 了 0, 1, 2, 但 并 不 能 直接 使 用 来 建 模 , 因为 在 实际 使 用 中 , 会 认为 2 
大 于 1, 也 就 是 red 大 于 green. 实际 却 不 是 这 样 的 , 所 以 需要 用 到 one-hot 
encoding, 需要 使 用 dummy variable, 每 一 个 label 最 后 被 表示 为 一 个 向 量 . 例如 ， 
blue sample can be encoded as blue=1, green=0, red=0. 


from sklearn.preprocessing import OneHotEncoder 


ohe = OneHotEncoder(categorical_features=[0], sparse=False) 


ohe.fit_transform(X) 


array([[ 0. , i a O. , 1. , 10.1], 
[ 0. , 0. , 1., #2. , 13.5], 
[ 1. , 0. , 0. , 3. , 15.3]]) 


pd.get_dummies(df[['price', 'color', 'size']]) 


price size color_blue color_green color_red 
0 10.1 1 0.0 1.0 0.0 
1 13.5 2 0.0 0.0 1.0 
2 153 3 1.0 0.0 0.0 


Partitioning a dataset in training and test sets 


the test set can be understood as the ultimate test of our model before we 
let It loose on the real world 
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# 12 BHwineX 4 


df_wine = pd.read_csv('data/wine.data', header=None) 


df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash' 
Alcalinity of ash', 'Magnesium', 'Total phenols', 
Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
"Color intensity', 'Hue', '0D280/0D315 of diluted wines', 'Proli 
ne' | 
print('Class labels', np.unique(df wine['Class label' ])) 
df wine.head() 
# 一 共有 三 种 label 
("Class labels', array([1, 2, 3])) 
Class Malic Alcalinity Total 
label cone! acid nsn of ash Magnesium phenols 
0 31 14.23 1.71 2.43 15.6 127 2.80 
1 1 13.20 1.78 2.14 11.2 100 2.65 
2 1 13.16 2.36 2.07 18.6 101 2.80 
3 | 1 14.37 1.95 2.50 16.8 113 3.89 
4 1 13.24 2.59 2.87 21.0 118 2.80 


使 用 train test split Bast £r 71 2 20 i Ed > 


from sklearn.cross_validation import train_test_split 
X, y = df wine.iloc[:, 1:].values, df wine.iloc[:, 0].values 


X train, X test, y train, y test = ^ 
train test split(X, y, test size-0.3, random state=0) 


stratified train test split 


stratified 切 分 ， 使 切 分 后 的 数据 集 更 好 地 保留 标签 的 相对 比例 


def ncy(labels): 
counts = np.unique(labels, return _counts=True)[1] 
n - len(labels) 
return counts / float(n) 


label frequency(y) 


array([ 0.33146067, 0.3988764 , 0.26966292]) 


label_frequency(y_train), label frequency(y_test) 


(array([ 0.32258065, 0.39516129, 0.28225806]), 
array([ 0.35185185, 0.40740741, 0.24074074])) 


X_train, X_test, y_train, y_test = \ 

train_test_split(X, y, stratify=y, test_size=0.3, random 
_State=0) 
label_frequency(y_train), label frequency(y_test) 


(array([ 0.33333333, 0.39837398, 0.26829268]), 
array([ 0.32727273, 0.4 , 0.27272721])) 


Bringing features onto the same scale 


Feature Scaling RÈ 2 hit =, BA Decision treefe random forests 时 不 用 担心 
这 个 问题 . 但 在 很 多 算法 和 模 型 下 都 是 scaling 后 拟 合 效果 更 好 . 


两 类 常用 方法 : normalization 和 standardization. 


| ipie n 


e normalization: rescaling to [0,1], 如 min-max scaling ' mar min 
e standardization: more practical, 因为 在 一 些 算 法 中 , weights 初始 值 都 设置 为 
0, 或 者 接近 0. standardization ERN 更 新 weights. 가 E standardize 


"uU 고니 - 1 


对 outlier 更 不 敏感 ， 受 影响 更 小 “std Ty 
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from sklearn.preprocessing import MinMaxScaler 


mms = MinMaxScaler() 
X_train_norm = mms.fit_transform(X_train) 
X_test_norm = mms.transform(X_test) 


from sklearn.preprocessing import StandardScaler 
stdsc = StandardScaler() 


X_train_std = stdsc.fit_transform(X_train) 
X_test_std = stdsc.transform(X_test) 


A visual example: 


ex = pd.DataFrame([0, 1, 2 ,3, 4, 5]) 


ex[1] = (ex[0] - ex[0].mean()) / ex[0].std() 


ex[2] = (ex[0] - ex[O].min()) / (ex[0].max() - ex[0].min()) 
ex.columns = ['input', 'standardized', 'normalized' | 


ex 

input standardized normalized 
0 0 -1.336306 0.0 
1 1 -0.801784 0.2 
2 2 -0.267261 0.4 
3 3 0.267261 0.6 
4 4 0.801784 0.8 
5 5 1.336306 1.0 


Feature selection 


Often we collected many features that might be related to a supervised prediction 
task, but we don't know which of them are actually predictive. To improve 
interpretability, and sometimes also generalization performance, we can use 


feature selection to select a subset of the original features. 
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根据 John, Kohavi, and Pfleger (1994)， 可 将 特征 选择 的 方法 分 为 两 类 : 


e Wrapper methods evaluate multiple models using procedures that add and/or 
remove predictors to find the optimal combination that maximizes model 
performance. In essence, wrapper methods are search algorithms that treat 
the predictors as the inputs and utilize model performance as the output to be 
optimized. 

e Filter methods evaluate the relevance of the predictors outside of the 
predictive models and subsequently model only the predictors that pass some 
criterion. For example, for classification problems, each predictor could be 
individually evaluated to check if there is a plausible relationship between it 
and the observed classes. Only predictors with important relationships would 
then be included in a classification model. Saeys, Inza, and Larranaga (2007) 
surveys filter methods. 


Both approaches have advantages and drawbacks. Filter methods are usually 
more computationally efficient than wrapper methods, but the selection criterion is 
not directly related to the effectiveness of the model. Also, most filter methods 
evaluate each predictor separately and, consequently, redundant (i.e. highly- 
correlated) predictors may be selected and important interactions between 
variables will not be able to be quantified. The downside of the wrapper method is 
that many models are evaluated (which may also require parameter tuning) and 
thus an increase in computation time. There is also an increased risk of over- 
fitting with wrappers. 


Sklearn 中 主要 使 用 Filter methods. 下 面 将 介绍 如 何 用 sklearn 进行 特征 选择 。 


Univariate statistics 
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The simplest method to select features is using univariate statistics, that is by 
looking at each feature individually and running a statistical test to see whether it 
is related to the target. 


sklearn 中 可 以 用 到 的 Univariate statistics 有 : 


e for regression: f regression 
e for classification: chi2 or f classif 


得 到 统计 量 和 p 值 之 后 ，sklearn 又 配套 了 不 同 的 选择 方法 : 


e SelectKBest removes all but the k highest scoring features 

e SelectPercentile removes all but a user-specified highest scoring percentage 
of features 

e using common univariate statistical tests for each feature: false positive rate 
SelectFpr, false discovery rate SelectFdr, or family wise error Selectrwe. 

e GenericUnivariateSelect allows to perform univariate feature selection with a 
configurable strategy. This allows to select the best univariate selection 
strategy with hyper-parameter search estimator. 


# VA chi2 和 SelectKbest 为 例 
from sklearn.feature_selection import chi2 
from sklearn.feature selection import SelectKBest 


select - SelectKBest(chi2, kz6) 
X uni selected - select.fit transform(X train, y train) 


print(X train.shape) 


print(X uni selected.shape) 


(123, 13) 
(123, 6) 


import matplotlib.pyplot as plt 
%matplotlib inline 


mask = select.get support() 
print(mask) 


plt.matshow(mask.reshape(1, -1), cmap='gray_r'); 


[False True False True True False True False False True Fal 
se False 


True] 





Recursive feature elimination 
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Given an external estimator that assigns weights to features (e.g., the coefficients 
of a linear model), recursive feature elimination (RFE) is to select features by 
recursively considering smaller and smaller sets of features. First, the estimator is 
trained on the initial set of features and weights are assigned to each one of them. 
Then, features whose absolute weights are the smallest are pruned from the 
current set features. That procedure is recursively repeated on the pruned set until 
the desired number of features to select is eventually reached. 


from sklearn.feature_selection import RFE 
from sklearn.svm import SVC 


SVC SVC(kernel="linear", C=1) 


rfe = RFE(estimator=svc, 


n_features_to_select=6, 
step=1) 
rfe.fit(X_train_std, y_train) 


X_rfe_selected = rfe.transform(X_train_std) 


mask = rfe.get_support() 
print(mask) 
plt.matshow(mask.reshape(1, -1), cmap='gray_r'); 


| True False False True False False True False False False Tr 
ue True 
True] 
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Feature selection using SelectFromModel 
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SelectrromModel is a meta-transformer that can be used along with any estimator 
that has a coef or feature importances attribute after fitting. The features are 
considered unimportant and removed, if the corresponding coef or 

feature importances values are below the provided threshold parameter. Apart 


from specifying the threshold numerically, there are build-in heuristics for finding a 
threshold using a string argument. Available heuristics are “mean”, “median” and 
float multiples of these like “0.1*mean”. 


一 些 模 型 能 比较 每 个 feature 的 重要 程度 ， 例 如 线性 模型 加 上 上 1 正则 项 之 后 不 重 
要 的 特征 的 系数 会 惩罚 为 0， 随 机 森林 模型 能 计算 每 个 feature 的 重要 程度 。 
然后 sklearn 有 个 SelectFromModel 有 函数 可 以 配合 这 些 模型 进行 特征 选择 


L1-based feature selection 


ETI 2 = ym ： 2 
e L2 norm: ll? = 25; 1 
3 = e M: 
e L1 norm: | LL | | ] cj ] U^ 1 


o 5 L23EM48 5% * L1 EMAILS SARA 0 
o 如 果 有 个 高 维 数据 , 有 很 多 特征 是 无 用 的 ,那么 L1 regularization 就 可 以 被 
当做 一 种 特征 选择 的 方法 . 
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from sklearn.linear_model import LogisticRegression 


lr = LogisticRegression(penalty='11', C=0.1) 
lr.fit(X_train_std, y_train) 

print('Training accuracy:', lr.score(X train std, y train)) 
print('Test accuracy:', lr.score(X test std, y test)) 


('Training accuracy:', 0.98373983739837401) 
('Test accuracy:', 0.96363636363636362) 


加 上 L1 正 则 项 后 ， 训 练 集 和 测试 集 上 的 表现 相近 ， 没 有 过 拟 合 


lr.intercept_ 


array([-0.26943618, -0.12656436, -0.79402866]) 


lr.coef 


array([[ 0.18750685, 


7 


o. 
[-0.74867392, 


7 


o. | 
0946123, 
o. 
[ 0. 
o. ， 
2356047, 


-0.3303/17/1, 


0.56622652, 


O. / 
-0,04330592, 


0.04873335, 
O. / 


-0./299406 , 


-0,52828297, 


1 
-0 


. 60382013], 
.00242426, 


44621713], 
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TUR d A RHE ARA (只 有 少数 非 零 系数 ) 
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# welghts coeff of the different features for different regulari 
zation strengths 

import matplotlib.pyplot as plt 

%matplotlib inline 


fig = plt.figure() 
ax = plt.subplot(111) 


colors = ['blue', 'green', 'red', 'cyan', 
‘magenta’, 'yellow', 'black', 
"pink", 'lightgreen', 'lightblue', 
‘gray', 'indigo', ‘orange’ | 


weights, params = [], [|] 
for c in np.arange(-4, 6): 
lr = LogisticRegression(penalty='11', C=10**c, random_state=0 


lr.fit(X train std, y train) 
weights.append(lr.coef [1]) 
params.append(10**c) 


weights = np.array(weights) 


for column, color in zip(range(weights.shape[1]), colors): 
plt.plot(params, weights[:, column], 
label=df_wine.columns[column+1], 
color=color) 
plt.axhline(0, color-'black', linestyle='--', linewidth=3) 
plt.xlim([10**(-5), 10**5]) 
plt.ylabel('weight coefficient') 
plt.xlabel('C') 
plt.xscale('log') 
plt.legend(loc='upper left') 
ax.legend(loc='upper center', 
bbox_to_anchor=(1.38, 1.03), 
ncol=1, fancybox=True); 


# plt.savefig('./figures/l1 path.png', dpi=300) 
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随 着 L1 正则 项 增 大 ， 无 关 特 征 别 排除 出 模型 (系数 变 为 0)， 因 此 L1 正则 可 以 作为 
特征 选择 的 一 种 方法 


结合 sklearn 的 SelectFromModel 进行 选择 
from sklearn.feature_selection import SelectFromModel 


model 11 = SelectFromModel(lr, threshold='median', prefit=True) 
X_11 selected = model _l1.transform(X) 


mask = model l1.get support() 
print(mask) 
plt.matshow(mask.reshape(1, -1), cmap='gray_r'); 


[ True False True True False False True False False True Tr 
ue False 


True] 





Tree-based feature selection 
随机 森林 算法 可 以 测量 各 个 特征 的 重要 性 ， 因 此 可 以 作为 特征 选择 的 一 种 手段 
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from sklearn.ensemble import RandomForestClassifier 


feat labels = df wine.columns[i:] 


forest = RandomForestClassifier(n_estimators=1000, 
random_state=0, 
n jobs=-1) 
forest.fit(X train, y train) 


importances = forest.feature importances . 

indices = np.argsort(importances)[::-1] 

for 1, 1dx in enumerate(indices): 
print("%2d) %-*s %f" % (1 + 1, 30, 


feat labels[idx], 
importances[idx])) 


1) 
2) 
3) 
2 
5) 
6) 
7) 
8) 
9) 
10) 
11) 
12) 
13) 


plt 
plt 


plt 


plt 
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.title('Feature Importances' ) 
.bar(range(X_train.shape[1]), 


importances[indices], 
color='lightblue', 
align-'center') 


feat labels[indices|, 


.Xlim([-1, X train.shape[:]]) 
.tight layout() 


Feature Importances 
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.Xticks(range(X_train.shape[1]), 
rotation=90) 
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结合 Sklearn 的 SelectFromModel 进行 特征 选择 


from sklearn.feature_selection import SelectFromModel 
from sklearn.ensemble import RandomForestClassifier 


select_rf = SelectFromModel(forest, threshold=0.1, prefit=True) 


# 或 者 重新 训练 一 个 模型 

# select = SelectFromModel(RandomForestClassifier(n_estimators=1 
0000, random_state=0, n jobs--1), threshold=0.15, prefit=True) 

# select.fit(X train, y train) 


X train rf - select rf.transform(X train) 


print(X train.shape[1]) + 原始 特征 维度 
print(X train rf.shape[1]) + 特征 选择 后 特征 维度 


13 


# 查看 选 出 的 特征 

mask = select rf.get support() 

for f in feat labels[mask]: 
print(f) 
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Flavanoids 

Color intensity 

0D280/0D315 of diluted wines 
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# 可 视 化 特征 选择 结果 ， 黑 色 的 是 选中 的 ， 和 白色 的 是 滤 过 的 
mask = select rf.get support() 

print (mask) 

plt.matshow(mask.reshape(1, -1), cmap='gray r'); 


[ True False False False False False True False False True Fal 
se True 
True] 
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也 能 将 随机 森林 和 Sequential selection 结合 起 来 





from sklearn.feature_selection import RFE 
select = RFE(RandomForestClassifier(n_estimators=100, random_sta 
te=0), 

n_features_to_select=3) 


select.fit(X train, y train) 


mask = select.get support() 
plt.matshow(mask.reshape(1, -1), cmap-'gray r'); 
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Feature extraction 


上 一 节 我 们 学 习 了 feature selection, 这 一 节 我 们 要 学 降 维 的 为 一 种 方法 ，feature 
extraction 
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Unsupervised dimensionality reduction via 
principal component analysis 


e improve computational efficiency 

e help to reduce the curse of dimensionality 

e unsupervised linear transformation technique 

e identify patterns in data based on the correlation between features 

e PCA aims to find the directions of maximum variance in high-dimensional 
data and projects it onto a new subspace with equal or fewer dimensions that 
the original one. 


summarize PCA algorithm: 


Standardize the d-dimensional dataset. 
Construct the covariance matrix. 
Decompose the covariance matrix into its eigenvectors and elgenvalues. 


Pe E 


Select k eigenvectors that correspond to the k largest eigenvalues, where k is 
the dimensionality of the new feature subspace ( k S d ). 

5. Construct a projection matrix W from the "top" k eigenvectors. 

6. Transform the d -dimensional input dataset X using the projection matrix W to 
obtain the new k -dimensional feature subspace. 


简单 来 说 ，PCA 是 在 找寻 variance 最 大 的 方向 
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仍然 使 用 Wine dataset 


import pandas as pd 
df_wine = pd.read_csv('data/wine.data', header=None) 


df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash' 
'Alcalinity of ash', 'Magnesium', 'Total phenols', 

'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 

"Color intensity', 'Hue', '0D280/0D315 of diluted wines', 'Proli 
ne'] 


df wine.head() 


cia conos ru Magnes oy 
0 1 14.23 1.71 2.43 15.6 127 2.80 
1 1 13.20 1.78 2.14 11.2 100 2.65 
2 1 13.16 2.36 2.07 18.6 101 2.80 
3 1 14.37 1.95 2.50 16.8 113 3.85 
4 1 13.24 2.59 2.87 21.0 118 2.80 


Splitting the data into 70% training and 30% test subsets. 


from sklearn.cross validation import train test split 
X, y = df wine.iloc[:, 1:].values, df wine.iloc[:, 0].values 


X train, X test, y train, y test = ^ 
train test split(X, y, test size-0.3, random state=0) 


Standardizing the data. 


from sklearn.preprocessing import StandardScaler 


sc = StandardScaler() 
X_train_std = sc.fit_transform(X_train) 
X_test_std = sc.transform(X_test) 


l =T] f | i | 
n E» a—] | o; 


计算 协 方差 矩阵 : TI 
值 A Fo athe ey v 


import numpy as np 


cov_mat = np.cov(X_train_std.T) 


eigen vals, eigen vecs = np.linalg.eig(cov_mat) 


print('Eigenvalues in %s' 96 eigen vals) 


Eigenvalues 


[ 4.8923083 2.46635032 1.42809973 1.01233462 


.60181514 


0.84906459 0 


0.52251546 0.08414846 0.33051429 0.29595018 0.16831254 ©. 


21432212 
0.2399553 ] 


Total and explained variance 


The variance explained ratio of an eigenvalue is simply the fraction of an 


eigenvalue and the total sum of the eigenvalues: 
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tot = sum(eigen_vals) 


var_exp = [(1 / tot) for 1 in sorted(eigen_vals, 


cum_var_exp = np.cumsum(var_exp) 


a 


el m^ 


^ op A 


reverse-True)] 


# plot variance 
import matplotlib.pyplot as plt 
%matplotlib inline 


plt.bar(range(1, 14), var exp, alpha=0.5, align='center', 
label-'individual explained variance' ) 

plt.step(range(1, 14), cum var exp, where='mid', 
label-'cumulative explained variance') 

plt.ylabel('Explained variance ratio') 

plt.xlabel('Principal components ') 

plt.legend(loc='best' ) 

plt.tight layout() 

# plt.savefig('./figures/pcal.png', dpi-300) 
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0 2 - E g 10 12 14 
Principal components 


第 一 个 component 能 解释 将 近 40% 的 variance, El A components 能 解释 近 
60% 


Feature transformation 


[back to top] 


eigen pairs = [(np.abs(eigen vals[i]), eigen vecs[:,i]) for i in 
range(len(eigen vals))] 


eigen pairs.sort(reverse-True) 


we only chose two eigenvectors for the purpose of illustration, since we are going 
to plot the data via a two-dimensional scatter plot later in this subsection. 


In practice, the number of principal components has to be determined from a 
trade-off between computational efficiency and the performance of the classifier. 


w = np.column_stack([eigen_pairs[0][1], eigen_pairs[1][1]]) 
print(w) | | 


14669811 -0.50417079] 
24224554 -0.24216889] 


ㄱㄱ 


利用 projection matrix KY ,我们 可 以 得 到 转换 后 的 数据 让 고 


OOOO. © 0.0 0.0 2 0.0.0 


, 02993442 
- 25519002 
.12079772 
. 38934455 
. 42326486 
. 30634956 
. 305/2219 
.09869191 
. 30032535 
. 36821154 
129259713 


. 28698484] 
. 06468718] 
22995385] 
. 09363991] 
. 01088622] 
01870216] 
. 03040352] 
54527081] 
27924322] 
174365 ] 
.36315461]] 


rW 


X_train_pca = X_train_std.dot(w) 
colors = ['r', 'b', | 
markers = ['s', 'x', ‘0'] 


for 1, c, m in zip(np.unique(y_train), colors, markers): 
plt.scatter(X train pca[y train--l, 0], 
X train pca[y train--l, 1], 
c=c, label=1, marker=m) 


plt.xlabel('PC 1') 
plt.ylabel('PC 2') 
plt.legend(loc-'lower left') 
plt.tight layout() 





data is more spread along the x-axis, a linear classier will likely be able to 
separate the classes well 


Principal component analysis in scikit-learn 


[back to top] 


Explained variance ratio 


from sklearn.decomposition import PCA 


pca = PCA() 
X_train_pca = pca.fit_transform(X_train_std) 
pca.explained_variance_ratio_ 


array([ 0.37329648, 0.18818926, 0.10896791, 0.07724389, 0.06 
478595, 

0.04592014, 0.03986936, 0.02521914, 0.02258181, 0.01 
830924, 

0.01635336, 0.01284271, 0.00642076]) 


plt.bar(range(i, 14), pca.explained variance ratio , alpha=0.5, 
align-'center') 

plt.step(range(i, 14), np.cumsum(pca.explained variance ratio ), 
where='mid') 

plt.ylabel('Explained variance ratio') 

plt.xlabel('Principal components'); 
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pca = PCA(n_components=2) 
X_train_pca = pca.fit_transform(X_train_std) 
X_test_pca = pca.transform(X_test_std) 


plt.scatter(X train pca[:,0], X train pca[:,1]) 
plt.xlabel('PC 1') 
plt.ylabel('PC 2'); 





If we compare the PCA projection via scikit-learn with our own PCA 
implementation, we notice that the plot above is a mirror image of the previous 


PCA via our step-by-step approach. 
Note that this is not due to an error in any of those two implementations, but the 


reason for this difference is that, depending on the eigensolver, eigenvectors can 


have either negative or positive signs. 


from matplotlib.colors import ListedColormap 


def 


1) 


(X, y, classifier, resolution=0.02): 


markers = ('s', 'x', 'o', '^', 'v') 
colors - ('red', 'blue', 'lightgreen', 'gray', 'cyan') 
cmap = ListedColormap(colors[:len(np.unique(y))]) 


x1 min, x1 max = X[:, O].min() - 3, X[:, O].max() + 1 
x2 min, x2 max = X[:, 1].min() - 1, X[:, 1].max() + 1 
XX1, xx2 - np.meshgrid(np.arange(x1 min, xi max, resolution) 


np.arange(x2 min, x2 max, resolution)) 
Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]). 


Z = Z.reshape(xx1.shape) 

plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap) 
plt.xlim(xxi.min(), xxi.max()) 
plt.ylim(xx2.min(), xx2.max()) 


for idx, cl in enumerate(np.unique(y)): 
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], 
alpha=0.8, c=cmap(idx), 
marker=markers[idx], label-cl) 


Training logistic regression classifier using the first 2 principal components. 


from sklearn.linear_model import LogisticRegression 


lr 
lr 


LogisticRegression( ) 
lr.fit(X_train_pca, y_train) 


特征 工程 


plot_decision_regions(X_train_pca, y_train, classifier=1r) 
plt.xlabel('PC 1') 

plt.ylabel('PC 2') 

plt.legend(loc-'lower left') 

plt.tight layout(); 

# plt.savefig('./figures/pca3.png', dpi-300) 





# 在 测试 集 上 测试 

plot decision regions(X test pca, y test, classifier=1r) 
plt.xlabel('PC1') 

plt.ylabel('PC2') 

plt.legend(loc-'lower left'); 

H 分 类 效果 也 很 不 错 
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Using kernel principal component analysis 
for nonlinear mappings 


via kernel PCA, we perform a nonlinear mapping that transforms the data onto a 
higher-dimensional space and use standard PCA in this higher-dimensional space 
to project the data back onto a lower-dimensional space where the samples can 
be separated by a linear classifier 


[back to top] 
most commonly used kernel: 


e polynomial kernel 
e hyperbolic tangent (sigmoid) kernel 
e Radial Basis Function (RBF) 


to implement RBF kernel PCA: 


1. compute the kernel (similarity) matrix k 

2. center the kernel matrix k 

3. Collect the top k eigenvectors of the centered kernel matrix based on their 
corresponding elgenvalues, ranked by decreasing magnitude. 


Implementing a kernel principal component 
analysis in Python 


[back to top] 


Radial Basis Function (RBF) or Gaussian kernel: \begin{align} k(x^((i)), x^((j))) =& 
exp(fract]lx^((i); - x*{()}{["2}{2\sigma®2}) 1 =& exp(-\gamma |Ix*¿(1); - x*{()}{["2) 
\end{align} 


from scipy.spatial.distance import pdist, squareform 
from scipy import exp 

from scipy.linalg import eigh 

import numpy as np 


def rbf kernel pca(X, gamma, n components): 


RBF kernel PCA implementation. 


Parameters 


X: {NumPy ndarray}, shape = [n samples, n features] 


gamma: float 
Tuning parameter of the RBF kernel 


n components: int 
Number of principal components to return 


Returns 


X pc: {NumPy ndarray), shape = [n samples, k features] 
Projected dataset 


# Calculate pairwise squared Euclidean distances 
# in the MxN dimensional dataset. 
sq dists = pdist(X, 'sgeuclidean') 


# Convert pairwise distances into a square matrix. 
mat sq dists - squareform(sq dists) 


4 Compute the symmetric kernel matrix. 
K = exp(-gamma * mat sq dists) 


# Center the kernel matrix. 


N = K.shapel 0] 
one_n = np.ones((N,N)) / N 
K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n 


eigvals, eigvecs = eigh(K) 


X pc = np.column stack((eigvecs[:, -1] 
for 1 in range(1, n components + 1)) 


return X pc 


Example 1: Separating half-moon shapes 


[back to top] 
建造 月 形 数据 ， 用 以 演示 
import matplotlib.pyplot as plt 


%matplotlib inline 
from sklearn.datasets import make_moons 


X, y = make_moons(n_samples=100, random_state=123) 
plt.scatter(X[y==0, 0], X[y==0, 1], color='red', markerz'^', alp 
ha=0.5) 

plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='o', al 


pha=0.5) 


plt.tight layout() 
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# standardize PCA 
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler 


scaler = StandardScaler() 
X std = scaler.fit transform(X) 


scikit pca - PCA(n components-2) 
X spca = scikit pca.fit transform(X std) 


fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(/,3)) 


ax[0].scatter(X_spca[y==0, 0], X_spca[y==0, 1], 
color='red', marker-z'^', alpha=0.5) 

ax[0].scatter(X_spca[y==1, 0], X_spca[y==1, i], 
color='blue', marker='o', alpha=0.5) 


ax[1].scatter(X_spca[y==0, 0], np.zeros((50,1)), 
color='red', markerz'^', alpha=0.5) 

ax[1].scatter(X_spca[y==1, 0], np.zeros((50,1)), 
color-'blue', marker='o', alpha=0.5) 


ax[0].set_xlabel('PC1') 
ax[0].set ylabel( PC2') 
ax[1].set_ylim([-1, 1]) 
ax[1].set_yticks([]) 

ax[1].set_xlabel('PC1') 


plt.tight layout() 
# plt.savefig('./figures/half moon 2.png', dpi-300) 





a linear classier would not be able to perform well 


# kernel PCA function rbf_kernel_pca 
from matplotlib.ticker import FormatStrFormatter 


X kpca = rbf kernel pca(X, gamma=15, n components=2) 


fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(/,3)) 

ax[0].scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], 
color-'red', markerz'^', alpha=0.5) 

ax[0].scatter(X_kpca[y==1, 0], X kpca[y--1, 1], 
color='blue', marker='o', alpha=0.5) 


ax[1].scatter(X_kpca[y==0, 0], np.zeros((50,1)), 
color='red', markerz'^', alpha=0.5) 
ax[1].scatter(X_kpca[y==1, 0], np.zeros((50,1)), 
color-'blue', marker='o', alpha=0.5) 


ax[0].set_xlabel('PC1') 

ax[0].set_ylabel('PC2') 

ax[1].set_ylim([-1, 1]) 

ax[1].set_yticks([]) 

ax[1].set_xlabel('PC1') 
ax[0].xaxis.set_major_formatter(FormatStrFormatter('%0.1f')) 
ax[1].xaxis.set_major_formatter(FormatStrFormatter('%0.1f')) 


plt.tight layout() 
# plt.savefig('./figures/half moon 3.png', dpi-300) 
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two classes (circles and triangles) are linearly well separated 


Example 2: Separating concentric circles 


[back to top] 


from sklearn.datasets import make_circles 


X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, 
factor=0.2) 


plt.scatter(X[y==0, 0], X[y==0, 1], color='red', markerz'^', alp 
ha=0.5) 

plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='o', al 
pha=0.5) 


plt.tight layout() 
# plt.savefig('./figures/circles 1.png', dpi-300) 
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# standard PCA 
scaler = StandardScaler() 
X_std = scaler.fit_transform(X) 


scikit_pca = PCA(n_components=2) 
X_spca = scikit_pca.fit_transform(X_std) 


fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(/,3)) 


ax[0].scatter(X_spca[y==0, 0], X_spca[y==0, 1], 
color='red', marker-z'^', alpha=0.5) 

ax[0].scatter(X_spca[y==1, 0], X_spca[y==1, i], 
color='blue', marker='o', alpha=0.5) 


ax[1].scatter(X_spca[y==0, 0], np.zeros((500,1)), 
color='red', marker-z'^', alpha=0.5) 

ax[1].scatter(X_spca[y==1, 0], np.zeros((500,1)), 
color='blue', marker='o', alpha=0.5) 


ax[0].set_xlabel('PC1') 
ax[0].set_ylabel('PC2') 
ax[1].set_ylim([-1, 1]) 
ax[1].set_yticks([]) 

ax[1].set_xlabel('PC1') 


plt.tight layout() 
# plt.savefig('./figures/circles 2.png', dpi-300) 





# kernel RBF 
X_kpca = rbf kernel pca(X, gamma=15, n components=2) 


fig, ax = plt.subplots(nrows=1,ncols=2, figsize=(/,3)) 

ax[0].scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], 
color='red', markerz'^', alpha=0.5) 

ax[0].scatter(X_kpca[y==1, 0], X_kpca[y==1, i], 
color='blue', marker='o', alpha=0.5) 


ax[1].scatter(X_kpca[y==0, 0], np.zeros((500,1)), 
color='red', marker-z'^', alpha=0.5) 

ax[1].scatter(X_kpca[y==1, 0], np.zeros((500,1)), 
color='blue', marker='o', alpha=0.5) 


ax[0].set_xlabel('PC1') 
ax[0].set ylabel( PC2') 
ax[1].set_ylim([-1, 1]) 
ax[1].set_yticks([]) 

ax[1].set_xlabel('PC1') 


plt.tight layout() 
# plt.savefig('./figures/circles 3.png', dpi-300) 
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Projecting new data points 


learn how to project data points that were not part of the training dataset 


特征 工程 
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from scipy.spatial.distance import pdist, squareform 
from scipy import exp 

from scipy.linalg import eigh 

import numpy as np 


def rbf kernel pca(X, gamma, n components): 


RBF kernel PCA implementation. 


Parameters 


X: {NumPy ndarray}, shape = [n samples, n features] 


gamma: float 
Tuning parameter of the RBF kernel 


n components: int 
Number of principal components to return 


Returns 


X pc: {NumPy ndarray}, shape = [n samples, k features] 
Projected dataset 


lambdas: list 
Eigenvalues 


# Calculate pairwise squared Euclidean distances 
# in the MxN dimensional dataset. 
sq dists = pdist(X, 'sgeuclidean') 


# Convert pairwise distances into a square matrix. 
mat sq dists - squareform(sq dists) 


4 Compute the symmetric kernel matrix. 
K = exp(-gamma * mat sq dists) 
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# Center the kernel matrix. 

N = K.shape[6] 

one_n = np.ones((N,N)) / N 

K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n 


# Obtaining eigenpairs from the centered kernel matrix 

# numpy.eigh returns them in sorted order 

eigvals, eigvecs - eigh(K) 

# Collect the top k eigenvectors (projected samples) 

alphas = np.column stack((eigvecs[:,-i] for 1 in range(i,n c 


omponents+1))) 


# Collect the corresponding eigenvalues 
lambdas = [eigvals[-i] for i in range(1,n_components+1)|] 


return alphas, lambdas 


X, y = make moons(n samples=100, random_state=123) 
alphas, lambdas = rbf_kernel_pca(X, gamma=15, n_components=1) 


x new = X[25] 
x new 


array([ 1.8713, 0.0093]) 


x proj = alphas[25] # original projection 
x proj 


array([ 0.0788]) 


def project_x(x_new, X, gamma, alphas, lambdas): 
pair dist = np.array([np.sum((x new-row)**2) for row in X]) 
k = np.exp(-gamma * pair dist) 
return k.dot(alphas / lambdas) 


# projection of the "new" datapoint 

x reproj = project x(x new, X, gamma=15, alphas-alphas, lambdas- 
lambdas) 

x reproj 


array([ 0.0788]) 


plt.scatter(alphas[y--0, 0], np.zeros((59)), 

color-'red', marker='A',alpha=0.5) 
plt.scatter(alphas[y==1, 0], np.zeros((50)), 

color='blue', marker='o', alpha=0.5) 
plt.scatter(x_pro], 0, color='black', label='original projection 
of point X[25]', markerz'^', s=100) 
plt.scatter(x reproj, O, color='green', label='remapped point X[ 
25] ', markerz'x', s=500) 
plt.legend(scatterpoints=1) 


plt.tight layout() 
# plt.savefig('./figures/reproject.png', dpi-300) 
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Kernel principal component analysis in scikit- 


learn 
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from sklearn.decomposition import KernelPCA 
random state=123) 
kernel-'rbf', gamma=15) 


X, y = make moons(n samples=100, 
= KernelPCA(n components-2, 
scikit kpca.fit transform(X) 


scikit kpca 

X skernpca - 

plt.scatter(X_skernpca[y==0, 0], X skernpca[y--0, i], 
, alpha=0.5) 


marker-'^' 


color='red', 
plt.scatter(X_skernpca[y==1, 0], X_skernpca[y==1, 1], 
color='blue', marker='o', alpha=0.5) 


Ana ANA y 
dpi-300) 


plt.xlabel('PC1') 

plt.ylabel('PC2') 

plt.tight layout() 
fig(' / figui 
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特征 工程 checklist 


3553 


e Do you have domain knowledge? If yes, construct a better set of ad hoc 
features 

e Are your features commensurate? If no, consider normalizing them. 

e Do you suspect interdependence of features? If yes, expand your feature set 
by constructing conjunctive features or products of features, as much as your 
computer resources allow you. 

e Do you need to prune the input variables (e.g. for cost, speed or data 
understanding reasons)? If no, construct disjunctive features or weighted 
sums of feature 

e Do you need to assess features individually (e.g. to understand their influence 
on the system or because their number is so large that you need to do a first 
filtering)? If yes, use a variable ranking method; else, do it anyway to get 
baseline results. 

e Do you need a predictor? If no, stop 

e Do you suspect your data is "dirty" (has a few meaningless input patterns 
and/or noisy outputs or wrong class labels)? If yes, detect the outlier 
examples using the top ranking variables obtained in step 5 as 
representation; check and/or discard them. 

e Do you know what to try first? If no, use a linear predictor. Use a forward 
selection method with the "probe" method as a stopping criterion or use the O- 
norm embedded method for comparison, following the ranking of step 5, 
construct a sequence of predictors of same nature using increasing subsets 
of features. Can you match or improve performance with a smaller subset? If 
yes, try a non-linear predictor with that subset. 

e Do you have new ideas, time, computational resources, and enough 
examples? If yes, compare several feature selection methods, including your 
new idea, correlation coefficients, backward selection and embedded 
methods. Use linear and non-linear predictors. Select the best approach with 
model selection 

e Do you want a stable solution (to improve performance and/or 
understanding)? If yes, subsample your data and redo your analysis for 
several “bootstrap”. 


一 个 使 用 正则 化 方法 进行 变量 选择 的 例子 


[back to top] 


Accuracy 


from 
from 
from 
from 
from 
from 
from 
from 
from 


Underfitting 


sklearn 
sklearn 
sklearn 
sklearn 
sklearn 
sklearn 
sklearn 
sklearn 
sklearn 


import 
import 
import 
import 
import 
import 
import 
import 
import 
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Sweet spot 


Model complexity 


datasets 


cross validation 


linear_model 
metrics 

tree 
neighbors 
svm 

ensemble 
cluster 


import matplotlib.pyplot as plt 


%matplotlib inline 


import numpy as np 


import seaborn as sns 


Generalization 


Overfitting 





np.random.seed( 123) 


mA" | = (=) AN CA N Va | C [ = f. | AN + mM mm = 
| JU | 3 alii LT , IA | | | 


X_all, y_all = datasets.make_regression(n_samples=50, n_features= 
50, n_informative=10) 


= GE 


X train, X test, y train, y test - cross validation.train test s 
plit(X all, y all, train size-0.5) 


X train.shape, y train.shape 
((25, 50), (25,)) 
X test.shape, y test.shape 


((25, 50), (25,)) 


Linear Regression 


. 9 
min, Y; wa; +b-— yill 
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line ar ren 


model = linear_model.LinearRegression( ) 


model.fit(X_train, y_train) 


/Users/alan/anaconda/lib/python2.7/site-packages/scipy/linalg/ba 
sic.py:884: RuntimeWarning: internal gelsd driver lwork query er 
ror, required iwork dimension not returned. This is likely the r 
esult of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 2 
1, 2010). Falling back to 'gelss' driver. 

warnings.warn(mesg, RuntimeWarning) 


LinearRegression(copy X-True, fit intercept-True, n_jobs=1, norm 
alize-False) 


def (resid): 
return sum(resid**2) 


resid train - y train - model.predict(X train) 
sse train - sse(resid train) 
sse train 


/.9634561/4897487/e-25 


resid test - y test - model.predict(X test) 
sse test - sse(resid test) 
sse test 


213555.61203039085 


结果 test data 显示 预测 效果 很 差 , 可 能 overfitting 


model.score(X_train, y_train) 


model.score(X_test, y_test) 


0.314074006/5201724 


# mud 
fig, ax = plot_residuals_and_coeff(resid_train, 
l.coef ); 


def plot residuals and coeff(resid train, 


fig, axes = plt.subplots(1, 3, figsize=(12, 3)) 


axes [0 |] 
axes [0 |] 
axes [0 |] 
axes [0 |] 
axes[1] 
axes[1] 
axes[1] 
axes[1] 
axes[ 2] 
axes[ 2] 
axes[ 2] 


.bar(np.arange(len(resid train)), 


.bar(np.arange(len(resid test)), 


resid train) 


.Set xlabel("sample number") 
.set_ylabel("residual") 
.set title("training data") 


resid test) 


.Set xlabel("sample number") 
.set_ylabel("residual") 

.set title("testing data") 
.bar(np.arange(len(coeff)), coeff) 
.set xlabel("coefficient number") 
.set ylabel("coefficient") 


fig.tight layout() 


return fig, axes 


residual 


resid test, 


resid test, coeff): 
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Ridge Regression 


L2 penalized, add squared sum of the weights to least-squares cost function 
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H 使 用 Ridge 正则 化 
model = linear_model.Ridge(alpha=5) 


model.fit(X_train, y_train) 


Ridge(alpha=5, copy_X=True, fit_intercept=True, max_iter=None, 
normalize=False, random_state=None, solver='auto', tol=0.001) 


resid_train = y_train - model.predict(X_train) 
sse_train = sum(resid_train**2) 
sse_train 


3292.9620358692/05 


resid_test = y_test - model.predict(X_test) 
sse test = sum(resid_test**2) 
sse_test 


209557 . 58585055024 


train datat SSE 提升 很 多 


# test model score {AAA 
model.score(X_train, y_train), model.score(X_test, y_test) 


(0.99003021243324718, 0.32691539290134652) 


fig, ax = plot_residuals_and_coeff(resid_train, resid_test, mode 
l.coef ) 
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LASSO Regression 


L1-norm certain weights can become zero, useful as a supervised feature 
selection technique. 


min,» Y^. [wT z; + b — yill? + allel, 
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model = linear_model.Lasso(alpha=1.0) 


model.fit(X_train, y_train) 


Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000, 
normalize=False, positive=False, precompute=False, random_sta 
te=None, 
selection='cyclic', tol=0.0001, warm start-False) 


resid_train = y_train - model.predict(X_train) 
sse train = sse(resid train) 
sse train 


309.7/49/1389532328 


resid test = y test - model.predict(X test) 
sse test - sse(resid test) 
sse test 


1489.11/606500263 


484% Ridge, SSE 都 减少 很 多 


fig, ax = plot_residuals_and_coeff(resid_train, resid_test, mode 
l.coef ) 
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40 


上 图 发 现 , coeff 有 很 多 都 是 0 


alphas = np.logspace(-4, 2, 100) 


# 寻找 LASSO 的 最 优 和 参数 alpha 

coeffs = np.zeros((len(alphas), X_train.shape[1])) 
sse_train = np.zeros_like(alphas) 

sse_test = np.zeros_like(alphas) 


for n, alpha in enumerate(alphas): 
model = linear_model.Lasso(alpha=alpha) 
model.fit(X_train, y_train) 
coeffs[n, :] = model.coef_ 
resid = y_train - model.predict(X_train) 
sse_train[n] = sum(resid**2) 
resid = y_test - model.predict(X_test) 
sse_test[n] = sum(resid**2) 


/Users/alan/anaconda/lib/python2.7/site-packages/sklearn/linear_ 
model/coordinate descent.py:466: ConvergenceWarning: Objective d 
id not converge. You might want to increase the number of iterat 
ions 

ConvergenceWarning) 


50 


fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharex=True) 


for n in range(coeffs.shape[1]): 
axes[0].plot(np.log10(alphas), coeffs[:, n], color='k', lwzo 
.5) 


axes[1].semilogy(np.log10(alphas), sse train, label="train") 
axes[1].semilogy(np.log10(alphas), sse test, label="test") 
axes|1].legend(loc=0) 


axes[0].set_xlabel(r"${\log_{10}}\alpha$", fontsize=18) 
axes[0].set_ylabel(r"coefficients", fontsize=18) 
axes[1].set_xlabel(r"${\log_{10}}\alpha$", fontsize=18) 
axes[1].set_ylabel(r"sse", fontsize=18) 
fig.tight_layout() 


coefficients 





10810 10810 


alpha 3& X , coeff 最 终 都 会 变 成 0, 而 train SSE 会 先 减 小 再 增加 , 而 test 是 一 直 在 增 
加 . 


在 -1 附近 , train SSE 最 小 , 而 coeff 大 概 有 8 个 不 是 0. 


H 使 用 LassoCV: Lasso linear model with iterative fitting along a 
regularization path 
model = linear model.LassoCV() 


model.fit(X_all, y_all) 


Lassocv(alphas=None, copy_X=True, cv=None, eps=0.001, fit_interc 
ept=True, 

max iter-1000, n_alphas=100, n jobs=1, normalize=False, posi 
tive=False, 

precompute-'auto', random state-None, selection-'cyclic', to 
120.0001, 

verbose-False) 


model.alpha_ 
0.06559238747534718 


resid_train = y_train - model.predict(X_train) 
sse train = sse(resid train) 
sse train 


1.5450589323148352 


resid test = y test - model.predict(X test) 
sse test - sse(resid test) 
sse test 


1.532141/406216176 


发 现 SSE 都 已 经 比较 接近 0 了 


model.score(X_train, y_train), model.score(X_test, y_test) 


(0.99999532217220677, 0.99999507886570982) 


fig, ax = 
l.coef_) 
# 9 个 non-zero coeff 
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Sections 


e Loading the Breast Cancer Wisconsin dataset 
e Streamlining workflows with pipelines 


e Model evaluation 


o The holdout method 

o K-fold cross validation 

o Stratified k-fold cross validation 
e Learning and validation curves 

o Diagnosing blas and variance problems with learning curves 

o Addressing overfitting and underfitting with validation curves 
e Grid search 

o Tuning hyperparameters via grid search 

o Randomized search 

o Model selection with nested cross-validation 


Loading the Breast Cancer Wisconsin 
dataset 


e Breast Cancer Wisconsin 数据 包括 569 19] E EX EVE MIRA 
e 数据 前 两 列 是 样本 ID 及 诊断 (M for malignant St, B for benigh 良性 ) 
e 后 面 30 列 是 细胞 核 的 图 片 的 数据 
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import pandas as pd 


df = pd.read_csv('data/wdbc.data', header=None) 
df.head() 


5 rows x 32 columns 


0 1 2 3 4 5 6 
O 842302 M 17.99 10.38 122.80 1001.0 0.11840 
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 


df.shape 


(569, 32) 


from sklearn.preprocessing import LabelEncoder 
X = df.loc[:, 2:].values 

y = df.loc[:, i].values 

le = LabelEncoder ( ) 

y = le.fit_transform(y) 

le.transform(['M', 'B']) 


array([1, 0]) 


from sklearn.cross validation import train test split 


X train, X test, y train, y test = ^ 


0.2776 
0.0786 
0.1599 
0.2839 
0.1328 


train test split(X, y, test size-0.20, random state=1) 


Streamlining workflows with pipelines 


fit a model including an arbitrary number of transformation steps and apply it to 
make predictions about new data. 
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pipe = make pipeline(T1(), T2(), Classifier()) 


Tl T2 Classifier 
pipe.fit(X, y) 
N 
T1.fit(X, y) y 
T1.transform(X) T2.fit(X1, y) | y à 
— X1 eun. [m N 
A 
T2.transform(x1) Classifier.fit( X2, y) 
— ~ X2 >| Classifier 


pipe.predict(X) 


p TL.transfarmiX]y y | T2.transformi X1) | ^ Classifier.predicti x'2] I 
ext ee X'2 ly 


from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA 

from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 


pipe_lr = Pipeline([('scl', StandardScaler()), 


('pca', PCA(n components-2)), + 
('clf', LogisticRegression(random state=1))] 


pipe lr.fit(X train, y train) 
print('Test Accuracy: %.3f' 96 pipe lr.score(X test, y test)) 
y pred = pipe lr.predict(X test) 


Test Accuracy: 0.947 


Model evaluation 
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The holdout method 


将 数据 分 为 


个 部 分 


NT 


e training set 用 于 训练 模型 
e validation set 用 于 模型 选择 和 调 参 
e test set 用 于 评估 最 终 模 型 的 泛 化 能 力 
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Original set 
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Machine learning 
algorithm 
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x di Final performance estimate 


K-fold cross validation 


重复 hold out method k X. 保留 test set， 剩 下 数据 随机 分 为 k 4, 将 其 中 一 组 留 作 
validation set, 其 余 做 training data, 更 换 validation 组 重复 训练 k 次 
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Training folds Test fold 


天 
> E, 





15 iteration 


E 
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import numpy as np 

from sklearn.cross_validation import KFold 

kfold = KFold(n=len(X_train), n_folds=10, random_state=1) 

scores = [| 

for k, 
pipe lr.fit(X train[train], y train[train]) 


(train, test) in enumerate(kfold): 


score = pipe lr.score(X train[test], y train[test |) 


scores.append(score) 


print('Fold: 96s, Class dist.: %s, Acc: %.3f' % (k+1, np.binc 
ount(y train[train]), score)) 
print('\nCV accuracy: %.3f +/- %.3f' 96 (np.mean(scores), np.std( 
scores))) 
Fold: 1, Class dist.: [256 153], Acc: 0.891 
Fold: 2, Class dist.: [254 155], Acc: 0.957 
Fold: 3, Class dist.: [258 151], Acc: 0.978 
Fold: 4, Class dist.: [257 152], Acc: 0.913 
Fold: 5, Class dist.: [255 154], Acc: 0.935 
Fold: 6, Class dist.: [258 152], Acc: 0.978 
Fold: 7, Class dist.: [257 153], Acc: 0.933 
Fold: 8, Class dist.: [254 156], Acc: 0.956 
Fold: 9, Class dist.: [259 151], Acc: 0.978 
Fold: 10, Class dist.: [257 153], Acc: 0.956 


CV accuracy: 0.947 +/- 0.028 


Stratified k-fold cross validation 


Stratified k-fold CV 方法 在 切 分 数据 时 ， 会 尽量 保持 各 标签 的 比例 ， 从 而 获得 更 准 
确 的 模型 效果 评估 
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from sklearn.cross validation import StratifiedKFold 


kfold = StratifiedkFold(y=y train, 
n_folds=10, 
random_state=1) 


scores = [| 
for k, 
pipe lr.fit(X train[train], y train[train]) 


(train, test) in enumerate(kfold): 


score = pipe lr.score(X train[test], y train[test]) 
scores.append(score) 


print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1, np.binc 
ount(y train[train]), score)) 
print('\nCV accuracy: %.3f +/- %.3f' 96 (np.mean(scores), np.std( 
scores))) 
Fold: 1, Class dist.: [256 153], Acc: 0.891 
Fold: 2, Class dist.: [256 153], Acc: 0.978 
Fold: 3, Class dist.: [256 153], Acc: 0.978 
Fold: 4, Class dist.: [256 153], Acc: 0.913 
Fold: 5, Class dist.: [256 153], Acc: 0.935 
Fold: 6, Class dist.: [257 153], Acc: 0.978 
Fold: 7, Class dist.: [257 153], Acc: 0.933 
Fold: 8, Class dist.: [257 153], Acc: 0.956 
Fold: 9, Class dist.: [257 153], Acc: 0.978 
Fold: 10, Class dist.: [257 153], Acc: 0.956 


CV accuracy: 0.950 +/- 0.029 


sklearn €. cross val score 函数 默认 使 用 stratified k-fold CV 


from sklearn.cross_validation import cross_val_score 


scores = cross_val_score(estimator=pipe_lr, 
X=X_train, 


y=y_train, 
cv=10, 
n_jobs=-1) 


print('CV accuracy scores:\n %s' % scores) 
print('CV accuracy:\n %.3f +/- %.3f' % (np.mean(scores), np.std( 
scores))) 


CV accuracy scores: 

| 0.89130435 0.97826087 0.97826087 0.91304348 0.93478261 0 
97777778 

0.93333333 0.95555556 0.97777778 0.95555556] 
CV accuracy: 

0.950 +/- 0.029 


Learning and validation curves 


diagnose if a learning algorithm has a problem with overfitting (high variance) or 
underfitting (high bias) 
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Diagnosing bias and variance problems with 
learning curves 


plotting the training and test accuracies as functions of the sample size 
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%matplotlib inline 
import matplotlib.pyplot as plt 
from sklearn.learning_curve import learning_curve 


pipe lr = Pipeline(|('scl', StandardScaler()), 
('clf', LogisticRegression(penalty='12', C=0 
.1, random_state=0))]) 


train_sizes, train_scores, test_scores =\ 
learning curve(estimator-pipe lr, X=X_train, y=y_train, 
train sizes-np.linspace(0.1, 1.0, 10), 


cv=10, n jobs--1) 


train mean = np.mean(train scores, axis=1) 
train std - np.std(train scores, axis=1) 
test mean - np.mean(test scores, axis=1) 
test std = np.std(test scores, axis=1) 


plt.plot(train sizes, train mean, 
color='blue', marker='o', 
markersize=5, label='training accuracy ') 


plt.fill between(train_sizes, 
train_mean + train_std, 
train_mean - train_std, 
alpha=0.15, color='blue') 


plt.plot(train_sizes, test_mean, 
color-'green', linestyle-'--', 
marker='s', markersize=5, 
label-'validation accuracy') 


plt.fill between(train_sizes, 
test mean + test std, 
test mean - test std, 


alpha=0.15, color='green') 


plt.grid() 

plt.xlabel('Number of training samples!) 
plt.ylabel('Accuracy') 

plt.legend(loc-'lower right') 

plt.ylim([0.9, 1.0]) 

plt.tight layout() 

# plt.savefig('./figures/learning curve.png', dpi-300) 
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Addressing overfitting and underfitting with 
validation curves 


plotting the training and test accuracies as functions of the model parameters 
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# validation curve, useful tool for improving the performance of 
a model 


from sklearn.learning_curve import validation_curve 


Hog X AN 


param range - [0.001, 0.01, 0.1, 1.0, 10.0, 100.0] 


train scores, test scores = \ 
validation curve(estimator-pipe lr, 
X-X train, y=y train, 
param name-'clf  C', 


param range-param range, cv=10) 


train mean = np.mean(train scores, axis=1) 
train std = np.std(train scores, axis=1) 
test mean = np.mean(test scores, axis=1) 
test std = np.std(test scores, axis=1) 


plt.plot(param range, train mean, 
color='blue', marker='o', 
markersize=5, label='training accuracy') 


plt.fill between(param range, train mean + train std, 
train mean - train std, alphaz0.15, 
color='blue' ) 


plt.plot(param_range, test_mean, 
color='green', linestyle='--', 
marker='s', markersize=5, 
label='validation accuracy' ) 


plt.fill between(param_range, 
test mean + test std, 
test mean - test std, 
alpha=0.15, color='green') 


plt.grid() 
plt.xscale('log') 
plt.legend(loc='lower right') 
plt.xlabel('Parameter C') 
plt.ylabel('Accuracy') 
plt.ylim([0.9, 1.0]) 
plt.tight layout() 


Accuracy 


e—= training accuracy 
= = validation accuracy 





Parameter C 


从 图 中 可 以 看 出 ， 随 着 C 增加 (regularization 减 小 )， 模 型 由 underfit -> optimal -> 
overfit 
RE C 参数 值 应 选用 0.1 


Grid search 
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Tuning hyperparameters via grid search 


finding the optimal combination of hyperparameter values. 
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# brute-force exhaustive search, i@% 
from sklearn.grid search import GridSearchCV 
from sklearn.svm import SVC 


svc = SVC(random_state=1) 
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0 


] 


param grid = {'C': param range? 


gs = GridSearchCV(estimator=svc, 
param_grid=param_grid, 
scoring='accuracy', 
cv=10, 
n jobs--1) # use all CPU 
gs = gs.fit(X train, y train) 


print(gs.best score ) # validation accuracy best 
print(gs.best params ) 


"II———— 녀 


0.626373626374 
['C': 0.0001) 


结合 pipeline 和 grid search 


pipe_svc = Pipeline([('scl', StandardScaler()), 
('clf', SVC(random_state=1))]) 


# linear SVM: inverse regularization parameter C 
# RBF kernel SVM: both C and gamma parameter 
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0 
] 
param grid = [{'clf_C': param range, 
'clf__kernel': ['linear']}, 


{ clf C': param range, 'clf_ gamma': param range, 


-clf_ kernel": ['rbf']}] 


gs GridSearchCV(estimator-pipe svc, 
param grid-param grid, 
scoring-'accuracy', 
cv=10, 
n_jobs=-1) # use all CPU 


gs.fit(X_train, y_train) 


gs 
print(gs.best_score_) # validation accuracy best 


print(gs.best_params_) 











0.978021978022 
{'clf__C': 0.1, 'clf kernel': ‘linear'} 


# 看 在 测试 集 上 的 效果 

clf = gs.best_estimator_ 

clf.fit(X_train, y_train) 

print('Test accuracy: %.3f' % clf.score(X_test, y_test)) 


Test accuracy: 0.965 


Randomized search 


Although grid search is a powerful approach for finding the optimal set of 
parameters, the evaluation of all possible parameter combinations is also 
computationally very expensive. 

An alternative approach to sampling different parameter combinations using scikit- 
learn Is randomized search. 
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from scipy.stats import expon 
from sklearn.grid search import RandomizedSearchCV 


pipe svc - Pipeline([('scl', StandardScaler()), 
('clf', SVC(random state-1))]) 


param dist = ('clf C': expon(scale=100), 
clf gamma': expon(scale=0.1), 
'clf__kernel': ['rbf']} 


np.random.seed(0) 

rs = RandomizedSearchCV(estimator-pipe svc, 
param_distributions=param_dist, 
n_1ter=20, scoring='accuracy', 
cv=10, n jobs--1) 


rs = rs.fit(X train, y train) 
print(rs.best score ) 
print(rs.best params ) 


0.975824175824 
{'clf__gamma': 0.009116102911900048, 'clf__C': 7.368535491284788 
, clf kernel': 'rbf') 


clf = rs.best_estimator_ 
clf.fit(X_train, y_train) 
print('Test accuracy: %.3f' % clf.score(X_test, y_test)) 


Test accuracy: 0.974 


Model selection with nested cross-validation 
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Training folds Test fold 
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Inner loop 


用 nested cross validation 来 比较 SVM 和 decision tree 模型 


gs = GridSearchCV(estimator-pipe svc, 
param_grid=param_grid, 
scoring='accuracy', 
CV=2) 


scores = cross val score(gs, X train, y train, scoring='accuracy' 
y CV-5) 
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(sc 
ores))) 


gl E) 


CV accuracy: 0.965 +/- 0.025 


from sklearn.tree import DecisionTreeClassifier 
gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0 
), 

param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6, 7 
, None]}], 

scoring='accuracy', 

CV=2 ) 
scores = cross val score(gs, X train, y train, scoring-'accuracy' 
y CV=5) 
print('CV accuracy: %.3f +/- %.3f' 96 (np.mean(scores), np.std(sc 
ores))) 


A El 


CV accuracy: 0.921 +/- 0.029 
从 结果 来 看 ， 应 该 选用 SVM 模型 
练习 1 : 自己 来 写 一 个 函数 ， 将 数据 分 成 两 个 部 分 
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A63 : 非 监 督学 习 方 法 


Sections 


e Some notable clustering routines 
e Grouping objects by similarity using k-means 
o k-means in Sklearn 
o k-means++ 
o Implementing k-means in Python 
o Using the elbow method to find the optimal number of clusters 


O 


Quantifying the quality of clustering via silhouette plots 
e Organizing clusters as a hierarchical tree 
o Performing hierarchical clustering on a distance matrix 
o Attaching dendrograms to a heat map 
o Applying agglomerative clustering via scikit-learn 


O 


Applying agglomerative clustering with Iris dataset 
e Locating regions of high density via DBSCAN 
e Learning from labeled and unlabeled data with label propagation 


Clustering is the task of gathering samples into groups of similar samples 
according to some predefined similarity or distance (dissimilarity) measure, such 
as the Euclidean distance. 


Here are some common applications of clustering algorithms: 


e Compression for data reduction 

e Summarizing data as a reprocessing step for recommender systems 

e Similarly: 
o grouping related web news (e.g. Google News) and web search results 
o grouping related stock quotes for investment portfolio management 
o building customer profiles for market analysis 

e Building a code book of prototype samples for unsupervised feature 

extraction 


Some notable clustering routines 
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The following are two well-known clustering algorithms. 


sklearn.cluster.KMeans : 
The simplest, yet effective clustering algorithm. Needs to be provided with the 
number of clusters in advance, and assumes that the data is normalized as 
input (but use a PCA model as preprocessor). 
sklearn.cluster.MeanShift : 
Can find better looking clusters than KMeans but is not scalable to high 
number of samples. 
sklearn.cluster.DBSCAN : 
Can detect irregularly shaped clusters based on density, i.e. sparse regions in 
the input space are likely to become inter-cluster boundaries. Can also detect 
outliers (samples that are not part of a cluster). 
sklearn.cluster.AffinityPropagation : 
Clustering algorithm based on message passing between data points. 
sklearn.cluster.SpectralClustering : 
KMeans applied to a projection of the normalized graph Laplacian: finds 
normalized graph cuts if the affinity matrix is interpreted as an adjacency 
matrix of a graph. 
sklearn.cluster.Ward : 
Ward implements hierarchical clustering based on the Ward algorithm, a 
variance-minimizing approach. At each step, it minimizes the sum of squared 
differences within all clusters (inertia criterion). 


Of these, Ward, SpectralClustering, DBSCAN and Affinity propagation can also 


work with precomputed similarity matrices. 


MeanS SpectralClustering 
EE hU 





Grouping objects by similarity using k- 
means 
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# make dataset 

from sklearn.datasets import make blobs 

X, y = make blobs(n samples=150, 
n features=2, 
centers=3, # MZR dq 
cluster_std=0.5, 
shuffle=True, 
random_state=0) 





import matplotlib.pyplot as plt 

%matplotlib inline 

plt.scatter(X[:,0], X[:,1], c='white', marker='o', s=50) 
plt.grid() 
plt.tight layout() 
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k-means 算法 


1. Randomly pick k centroids from the sample points as initial cluster centers. 


Assign each sample to the nearest centroid n7. j € {LR} 

Move the centroids to the center of the samples that were assigned to it. 
Repeat the steps 2 and 3 until the cluster assignment do not change or a 
user-defined tolerance or a maximum number of iterations is reached. 


> e I 


Visualizing K-Means Clustering 


如 何 来 测量 两 个 物体 之 间 的 相似 度 , similarity 
或 者 如 何 表 示 两 个 物体 之 间 的 距离 , distance 


最 第 见 的 一 种 距离 度量 是 squared Euclidean distance: 
de, y)? = SO" (aj — yy)” = le — vl? 


k-means 可 转化 为 最 优化 的 问题 , Xx 4] 16 within-cluster sum of squared errors 


SSE = t SoR illae — y TT 
(SSE), 又 称 为 clusterinertia ^" dai dejar WU im AN ge uU 
是 第 | 个 有 聚 类 的 中 心 ， 如 果 样 本 i LE a jo ol 1) — l» By ur n] — () 


k-means in Sklearn 
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# 使 用 sklearn 中 KMeans ARR 
from sklearn.cluster import KMeans 
km = KMeans(n_clusters=3, # k X € €E ix, IAN k 可 能 结果 就 不 对 


init-'random', 

n_init=10, # 重复 运行 算法 10 次 ， 选 其 中 最 好 的 聚 类 模型 ， 避 
免 不 好 的 初始 化 值 带 来 的 影响 

max_1ter=300, 

tol=1e-04, 

random_state=0) 


y_km = km.fit_predict(X) 





plt.scatter(X[y_km==0,0], 
X[y_km==0,1], 
s=50, 
c='lightgreen', 
marker='s', 
label='cluster 1') 
plt.scatter(X[y_km==1,0], 
X[y_km==1,1], 
s=50, 
c='orange', 
marker='o', 
label='cluster 2') 
plt.scatter(X[y_km==2,0], 
X[y_km==2,1], 
s=50, 
c='lightblue', 
marker='v', 
label='cluster 3') 
plt.scatter(km.cluster centers [:,0], 
km.cluster centers [:,1], 
s=250, 
marker='*', 
c='red', 
label='centroids') 
plt.legend( ) 
plt.grid() 
plt.tight layout() 
Zplt.savefig('./figures/centroids.png', dpi=300) 
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k-means++ 
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k-means++ 算法 通过 改善 centroids 初始 值 的 设置 ， 来 优化 k-means 


1. Initialize an empty set M to store the k centroids being selected 


2. Randomly choose the first centroid p” from the input samples and assign it 
tO M 


3. For each sample ©"! that is not in M, find the minimum squared distance 


RU) ANY? 
día”, M) to any of the centroids in M 


4. To randomly select the next centroid uP) use a eed rota 
(pe?! M) 4 


~ oar fa) ari? 
distribution equal to >: 41740) 


5. Repeat steps 2 and 3 until k centroids are chosen 
6. Proceed with the classic k-means algorithm 


sklearn 里 使 用 k-means++ 算法 只 需 在 KMeans() EX É init="k-means++" 


K-Means 不 
import numpy as np 
from sklearn.utils import shuffle 
from sklearn.utils import check random state 
from sklearn.cluster import KMeans 


random_state = np.random.RandomState(0) 


# Number of run (with randomly generated dataset) for each strat 
egy so as 

# to be able to compute an estimate of the standard deviation 
n_runs = 5 


# k-means models can do several random inits so as to be able to 
trade 

# CPU time for convergence robustness 

n_init_range = np.array([1, 5, 10, 15, 20]) 


# Datasets generation parameters 
n_samples_per_center = 100 
grid_size = 3 

scale = 0.1 

n_clusters = grid size ** 2 


def make data(random state, n samples per center, grid size, sca 
le): 
random state - check random state(random state) 
centers = np.array([[i, j] 
for 1 in range(grid size) 
for j in range(grid_size)]) 
n clusters true, n features - centers.shape 


noise = random state.normal( 
scale-scale, size=(n samples per center, centers.shape[i 


1)) 


X = np.concatenate([c + noise for c in centers]) 
np.concatenate([[i] * n samples per center 


< 
Il 


for 1 in range(n_clusters_true)]) 
return shuffle(X, y, random state-random state) 


fig - plt.figure() 
plots - [] 


legends = [|] 


cases = | 
(KMeans, 'k-means++'), 
(KMeans, ‘random' ) 


for factory, init in cases: 

print("Evaluation of %s with %s init" % (factory. name , i 
nit)) 

inertia - np.empty((len(n init range), n runs)) 


for run id in range(n runs): 
X data, y - make data(run id, n samples per center, grid 
_size, scale) 
for 1, n init in enumerate(n init range): 
kmean = factory(n clusters=n clusters, init=init, ra 
ndom state-run id, 
n init-n init).fit(X data) 
inertia[i, run id] = kmean.inertia 
p = plt.errorbar(n init range, inertia.mean(axis=1), inertia 
.std(axis=1)) 
plots.append(p[0]) 
legends.append("%s with %s init" % (factory. name , init)) 


plt.xlabel('n init') 

plt.ylabel('inertia') 

plt.legend(plots, legends) 

plt.title("Mean inertia for various k-means init across %d runs" 
% n runs); 


Evaluation of KMeans with k-means++ init 
Evaluation of KMeans with random init 


Mean inertia for various k-means init across 5 runs 





—— KMeans with k-means++ init 
— KMeans with random init 


inertia 


Implementing k-means in Python 
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import numpy as np 
from sklearn.metrics import pairwise_distances 


def get initial centroids(data, k, seedzNone): 
"''Randomly choose k data points as initial centroids''' 
if seed is not None: # useful for obtaining consistent resul 
ts 
np.random.seed(seed) 
n = data.shape[0] # number of data points 


# Pick K indices from range [0, N). 
rand indices = np.random.randint(0, n, k) 


4 Keep centroids as dense format, as many entries will be no 
nzero due to averaging. 

# As long as at least one document in a cluster contains a w 
ord, 

# it will carry a nonzero weight in the TF-IDF vector of the 
centroid. 

centroids - data[rand indices,:] 


return centroids 


def smart initialize(data, k, seed=None): 
"''Use k-means++ to initialize a good set of centroids''' 
if seed is not None: # useful for obtaining consistent resul 


ts 
np.random.seed(seed) 
centroids = np.zeros((k, data.shape[1])) 
# Randomly choose the first centroid. 
# Since we have no prior knowledge, choose uniformly at rand 
om 


idx = np.random.randint(data.shape[0]) 

centroids[0] = data[idx,:] 

# Compute distances from the first centroid chosen to all th 
e other data points 

distances = pairwise distances(data, centroids[0:1], metric- 
'euclidean').flatten() 


for 1 in xrange(1, k): 

4 Choose the next centroid randomly, so that the probabi 
lity for each data point to be chosen 

# is directly proportional to its squared distance from 
the nearest centroid. 

# Roughtly speaking, a new centroid should be as far as 
from ohter centroids as possible. 

idx = np.random.choice(data.shape[0], 1, p=distances/sum 
(distances)) 

centroids[i] = data[idx,:] 

4 Now compute distances from the centroids to all data p 
oints 

distances = np.min(pairwise distances(data, centroids[o: 
1*1], metric-'euclidean'),axis-1) 


return centroids 


def assign clusters(data, centroids): 


# Compute distances between each data point and the set of c 
entroids: 
distances from centroids - pairwise distances(data, centroid 


S) 


# Compute cluster assignments for each data point: 
cluster_assignment = np.argmin(distances_from_centroids, axi 
s=1) 


return cluster_assignment 


def revise centroids(data, k, cluster assignment): 
new centroids - [] 
for 1 in xrange(k): 
4 Select all data points that belong to cluster i. Fill 
in the blank (RHS only) 
member data points - data[cluster assignment == i] 
# Compute the mean of the data points. Fill in the blank 
(RHS only) 
centroid = member data points.mean(axis=0) 
new centroids.append(centroid) 
new centroids - np.array(new centroids) 
return new centroids 


def kmeans(data, k, init='kmeans++', maxiter-100, seed=None): 
4 Initialize centroids 
if init == 'kmeans++': 
centroids = smart_initialize(data, k, seed) 
else: 
centroids = get_initial_centroids(data, k, seed) 
prev_cluster_assignment = None 


for itr in xrange(maxiter ): 
# 1. Make cluster assignments using nearest centroids 
cluster_assignment = assign_clusters(data, centroids) 


# 2. Compute a new centroid for each of the k clusters, 
averaging all data points assigned to that cluster. 


centroids = revise centroids(data, k, cluster assignment 


if prev cluster assignment is not None and \ 
(prev cluster assignment--cluster assignment).all(): 
break 


prev cluster assignment = cluster assignment[:] 


return centroids, cluster assignment 


centers, y km = kmeans(X, 3, seed=0) 


plt.scatter(X[y_km==0,0], 
X[y_km==0,1], 
s=50, 
c='lightgreen', 
marker='s', 
label='cluster 1') 

plt.scatter(X[y_km==1,0], 
X[y_km==1,1], 
s=50, 
c='orange', 
marker='o', 
label='cluster 2') 

plt.scatter(X[y_km==2,0], 
X[y_km==2,1], 
s=50, 
c='lightblue', 
marker='v', 
label='cluster 3') 

plt.scatter(centers|:,0], 
centers[:,1], 
s=250, 
marker='*', 
c='red', 
label-'centroids') 

plt.legend() 

plt.grid() 

plt.tight layout() 

Zplt.savefig('./figures/centroids.png', dpi=300) 
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Using the elbow method to find the optimal 
number of clusters 


通 第 我 们 并 不 知道 数据 能 分 成 几 个 聚 类 ， 所 以 能 有 办 法 选择 合适 kdü3bu ER 
[back to top] 


判断 聚 类 效果 可 用 within-cluster SSE (Distortion) * 3& ^T & KMeans() 中 的 


+ 2€ 


inertia 属性 获得 
print('Distortion: %.2f' 96 km.inertia ) 
Distortion: 72.48 


The Elbow method is a "rule-of-thumb" approach to finding the optimal number of 
clusters. 


distortions = [| 
for 1 in range(1, 11): 
km = KMeans(n clusters-i, 
init='k-means++', 
n_init=10, 
max_iter=300, 
random_state=0) 
km.fit(X) 
distortions.append(km.inertia ) 
plt.plot(range(1,11), distortions , marker='0') 
plt.xlabel('Number of clusters') 
plt.ylabel('Distortion') 
plt.tight layout() 
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Number of clusters 


从 图 中 就 可 以 看 出 , 3 是 扬 点 ,所 以 k3 是 最 好 的 选择 


Quantifying the quality of clustering via 
silhouette plots 
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另 一 种 评价 聚 类 效果 的 方法 是 silhouette analysis, 衡量 的 是 一 个 类 别 中 的 样本 是 否 
足够 紧凑 组 合 


计算 x" 的 silhouette coefficient 的 步骤 为 : 


1. Calculate the cluster cohesion a!’ as the average distance between a 
sample ©"! and all other points in the same cluster. 
2. Calculate the cluster separation b" from the next closest cluster as the 


n) 
average distance between the sample `° and all samples in the nearest 
cluster. 


3. Calculate the silhouette si as the difference between cluster cohesion and 
separation divided by the greater of the two, as shown here: 


(1) pr! —g 
5 


"mura f pD ati) l 


| | 


import numpy as np 
from matplotlib import cm 
from sklearn.metrics import silhouette samples 


km = KMeans(n clusters=3, 
init='k-means++', 
n_init=10, 
max iter-300, 
tol-1e-04, 
random state=0) 
y km = km.fit predict(X) 


cluster labels - np.unique(y km) 
n clusters = cluster labels.shape[0] 
silhouette vals = silhouette samples(X, y km, metric-'euclidean' 
) 
y ax lower, y ax upper = 0, 0 
yticks = [] 
for 1, c in enumerate(cluster_labels): 
c silhouette vals = silhouette vals[y km == c] 
c silhouette vals.sort() 
y ax upper += len(c silhouette vals) 


color = cm.jet(i / float(n_clusters)) 
plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, h 
eight=1.0, 
edgecolor='none', color=color) 


yticks.append((y_ax_lower + y_ax_upper) / 2) 
y ax lower += len(c silhouette vals) 


silhouette avg - np.mean(silhouette vals) 
plt.axvline(silhouette avg, color="red", linestyle="--") 


plt.yticks(yticks, cluster_labels + 1) 
plt.ylabel('Cluster') 
plt.xlabel('Silhouette coefficient ') 


plt.tight layout() 
# plt.savefig('./figures/silhouette.png', dpi-300) 
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Silhouette coefficient 
红线 表示 所 有 数据 silhouette coef 的 平均 值 ， 它 可 作为 聚 类 模型 的 一 个 度量 指标 


Comparison to "bad" clustering: 


km = KMeans(n_clusters=2, # 设 定 为 2 
init='k-means++', 
n_init=10, 
max iter-300, 
tol=1e-04, 
random_state=0) 

y_km = km.fit_predict(X) 


plt.scatter(X[y_km==0,0], 
X[y_km==0,1], 
s=50, 
c-'lightgreen', 
marker='s', 
label='cluster 1') 

plt.scatter(X[y_km==1,0], 
X[y_km==1,1], 
s=50, 
c='orange', 
marker='o', 
label='cluster 2') 


plt.scatter(km.cluster centers [:,0], km.cluster centers [:,1], 
s=250, marker='*', c='red', label-'centroids') 

plt.legend() 

plt.grid() 

plt.tight layout() 

Zplt.savefig('./figures/centroids bad.png', dpi-300) 
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cluster_labels = np.unique(y_km) 
n clusters = cluster labels.shape[0] 
silhouette vals = silhouette samples(X, y km, metric-'euclidean' 
) 
y ax lower, y ax upper = 0, 0 
yticks - [] 
for 1, c in enumerate(cluster labels): 
c silhouette vals - silhouette vals[y km -- c] 
c silhouette vals.sort() 
y ax upper += len(c silhouette vals) 
color = cm.jet(i / float(n clusters)) 
plt.barh(range(y ax lower, y ax upper), c silhouette vals, h 
eight=1.0, 
edgecolor='none', color=color) 


yticks.append((y_ax_lower + y_ax_upper) / 2) 
y_ax_lower += len(c_silhouette_vals) 


silhouette_avg = np.mean(silhouette_vals) 
plt.axvline(silhouette_avg, color="red", linestyle="--") 


plt.yticks(yticks, cluster_labels + 1) 
plt.ylabel('Cluster') 


plt.xlabel('Silhouette coefficient ') 


plt.tight layout() 
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Organizing clusters as a hierarchical tree 
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One nice feature of hierachical clustering is that we can visualize the results as a 
dendrogram, a hierachical tree. Using the visualization, we can then decide how 
"deep" we want to cluster the dataset by setting a "depth" threshold. Or in other 
words, we don't need to make a decision about the number of clusters upfront. 


Agglomerative and divisive hierarchical clustering 


Furthermore, we can distinguish between 2 main approaches to hierarchical 
clustering: Divisive clustering and agglomerative clustering. In agglomerative 
clustering, we start with a single sample from our dataset and iteratively merge it 
with other samples to form clusters -- we can see it as a bottom-up approach for 
building the clustering dendrogram. 

In divisive clustering, however, we start with the whole dataset as one cluster, and 
we iteratively split it into smaller subclusters -- a top-down approach. 


In this notebook, we will use agglomerative clustering. 


Single and complete linkage 


Now, the next question is how we measure the similarity between samples. One 
approach is the familiar Euclidean distance metric that we already used via the K- 
Means algorithm. as a refresher, the distance between 2 m-dimensional vectors P 
and Y can be computed as: 


\beginfalign} \mathrm{d}(\mathbf{q},\mathbf{p}) & = \sqrt{(q7-p_1)42 + (q 2-p 2)^2 
+ \cdots + (q m-p m)^2) M8pt] = \sqrt{\sum{j=1}4m (q_j-p_j)*2}.\end{align} 


However, that's the distance between 2 samples. Now, how do we compute the 
similarity between subclusters of samples in order to decide which clusters to 
merge when constructing the dendrogram? l.e., our goal is to iteratively merge the 
most similar pairs of clusters until only one big cluster remains. There are many 
different approaches to this, for example single and complete linkage. 


In single linkage, we take the pair of the most similar samples (based on the 
Euclidean distance, for example) in each cluster, and merge the two clusters 
which have the most similar 2 members into one new, bigger cluster. 


In complete linkage, we compare the pairs of the two most dissimilar members of 
each cluster with each other, and we merge the 2 clusters where the distance 
between its 2 most dissimilar members is smallest. 


Most similar members 
(single linkage) 





Most dissimilar members 
(complete linkage) 


import pandas as pd 
import numpy as np 


np.random.seed( 123) 


variables = ['X', 'Y', 'Z!] 
labels = ['ID O','ID 1','ID 2','ID 3','ID 4'] 


X = np.random.random sample([5,3])*10 
df = pd.DataFrame(X, columns-variables, index-labels) 
df 


X Y 
ID 0 6.964692 2.861393 2.268515 
ID 1 0.013148 7.194690 4.231065 
ID_2 9.807642 6.848297 4.809319 
ID 3 3.921175 3.431780 7.290497 
ID 4 4.385722 0.596779 3.980443 


Performing hierarchical clustering on a 
distance matrix 
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from scipy.spatial.distance import pdist,squareform 


row dist = pd.DataFrame(squareform(pdist(df, metric='euclidean') 
), columns-labels, index-labels) 
row dist 


ID_0 ID_1 ID_2 ID_3 ID_4 
ID O 0.000000 4.973534 5.516653 5.899885 3.835396 
ID 1 4.973534 0.000000 4.347073 5.104311 6.698233 
ID 2 | 5.516653 4.347073 0.000000 7.244262 8.316594 
ID 3 5.899885 5.104311 7.244262 0.000000 4.382864 
ID 4 3.835396 6.698233 8.316594 4.382864 0.000000 


We can either pass a condensed distance matrix (upper triangular) from the pdist 
function, or we can pass the "original" data array and define the metric-'euclidean' 
argument in linkage. However, we should not pass the squareform distance 
matrix, which would yield different distance values although the overall clustering 
could be the same. 


from scipy.cluster.hierarchy import linkage 


row clusters - linkage(row dist, method-'complete', metric='eucl 
idean') 
pd.DataFrame(row clusters, 

columns-['row label 1', 'row label 2', 'distance', 
‘no. of items in clust.'], 

index-['cluster %d' %(1+1) for 1 in range(row clust 
ers.shape[0])]) 


row label 1 row label 2 distance no. of items in clust. 
cluster 1 0.0 4.0 6.521973 2.0 
cluster2 1.0 2.0 6.729603 2.0 
cluster3 3.0 9.0 8.539247 3.0 


cluster 4 6.0 7.0 12.444824 5.0 


人 ^) RMF v nnt mA Y x7 08700 h J Can A ANCA A A = c ㄴㄴ ~ ^ ma EF y 4 y /Á 
# £. COrrect approacn: condensed distance matr Lx 


row clusters - linkage(pdist(df, metric-'euclidean'), method-'co 
mplete' ) 
pd.DataFrame(row_clusters, 

columns-['row label 1', 'row label 2', 'distance', 


‘no. of items in clust.'], 
index-['cluster %d' %(1+1) for i in range(row clust 


ers.shape[0])]) 


row label 1 row label 2 distance no. of items in clust. 








cluster 1 0.0 4.0 3.835396 2.0 
cluster2 1.0 2.0 4.347073 20 
cluster 3 3.0 9.0 0.899885 3.0 
cluster 4 6.0 7.0 8.316594 5.0 
orrect approach: Input sample matrix 


row clusters - linkage(df.values, method-'complete', metric='euc 
lidean') 
pd.DataFrame(row_clusters, 

columns-['row label 1', 'row label 2', 'distance', 


‘no. of items in clust.' |], 
index-['cluster %d' %(1+1) for i in range(row clust 


ers.shape[0])]) 


row label 1 row label 2 distance no. of items in clust. 


cluster 1 0.0 4.0 3.835396 2.0 
cluster 2 1.0 2.0 4.347073 2.0 
cluster3 3.0 5.0 5.899885 3.0 


cluster4 60 7.0 8.316594 5.0 


# 可 视 化 结 采 ， 使 用 dendrogram 
from scipy.cluster.hierarchy import dendrogram 


# make dendrogram black (part 1/2) 
# from scipy.cluster.hierarchy import set_link_color_palette 
# set link color palette(| 'black']) 


row dendr = dendrogram(row clusters, 

labels=labels, 

# make dendrogram black (part 2/2) 

# color_threshold=np.inf 

) 
plt.tight layout() 
plt.ylabel('Euclidean distance'); 
Zplt.savefig('./figures/dendrogram.png', dpi-300, bbox inches='t 
ight') 
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Attaching dendrograms to a heat map 


distance 


Euclidean 
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# plot row dendrogram 


fig plt.figure(figsize=(8, 8), facecolor='white') 


fig.add_axes([0.09, 0.1, 0.2, 0.6]) 


axd 


# note: for matplotlib < v1.5.1, please use orientation='right' 
row_dendr = dendrogram(row_clusters, orientation='left') 


# reorder data with respect to clustering 
df_rowclust = df.ix[row dendr['leaves'][::-1]] 


axd.set_xticks([]) 
axd.set_yticks([]) 


# remove axes spines from dendrogram 
for 1 in axd.spines.values(): 
1.set_visible(False) 


# plot heatmap 

axm = fig.add_axes([0.23, 0.1, 0.6, 0.6]) + x-pos, y-pos, width 
, height 

cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='ho 
t_r ) 

fig.colorbar(cax) 

axm.set_xticklabels([''] + list(df rowclust.columns)) 
axm.set_yticklabels([''] + list(df rowclust.index)); 


# plt.savefig('./figures/heatmap.png', dpi-300) 
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Applying agglomerative clustering via scikit- 
learn 
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from sklearn.cluster import AgglomerativeClustering 
ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', 
linkage='complete') 


labels = ac.fit predict(X) 
print('Cluster labels: %s' % labels) 


Cluster labels: [0 110 0] 


结果 与 前 面 一 致 


Applying agglomerative clustering with iris 
dataset 


[back to top] 


from sklearn import datasets 


iris = datasets.load iris() 

X = iris.data[:, [2, 3]] 

y = iris.target 

n_samples, n_features = X.shape 


plt.scatter(X[:, 0], X[:, 1], c=y); 





from scipy.cluster.hierarchy import linkage 
from scipy.cluster.hierarchy import dendrogram 


clusters = linkage(X, 
metric-'euclidean', 
method='complete') 


dendr = dendrogram(clusters) 


plt.ylabel('Euclidean Distance'); 


Euclidean Distance 
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from sklearn.cluster import AgglomerativeClustering 


ac = AgglomerativeClustering(n_clusters=3, 
affinity='euclidean', 
linkage='complete') 


prediction = ac.fit_predict(X) 
print('Cluster labels:\n %s\n' % prediction) 


Cluster labels: 
[2222222222222222222222222222222 
2 22222 


22222222222220001000101111011010 
10100 
11000111100001111011111111000000 
00000 

0 0 09 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 

0 0] 


plt.scatter(X[:, 0], X[:, 1], c=prediction); 





Locating regions of high density via 
DBSCAN 
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Another useful approach to clustering is Density-based Spatial Clustering of 
Applications with Noise (DBSCAN). In essence, we can think of DBSCAN as an 
algorithm that divides the dataset into subgroup based on dense regions of point. 


In DBSCAN, we distinguish between 3 different "points": 


e Core points: A core point is a point that has at least a minimum number of 
other points (MinPts) in its radius epsilon. 

e Border points: A border point is a point that is not a core point, since it doesn't 
have enough MinPts in its neighborhood, but lies within the radius epsilon of 
a core point. 

e Noise points: All other points that are neither core points nor border points. 


Core point 





Noise point Border point 


O MinPts=3 ^ 
C) O Si- 


给 每 个 点 label 2/6, DBSCAN 算法 就 是 下 面 两 步 : 


1. Form a separate cluster for each core point or a connected group of core 
points (core points are connected if they are no farther away than ). 
2. Assign each border point to the cluster of its corresponding core point. 


A nice feature about DBSCAN is that we don't have to specify a number of 
clusters upfront. However, it requires the setting of additional hyperparameters 
such as the value for MinPts and the radius epsilon. 


from sklearn.datasets import make_moons 


X, y = make_moons(n_samples=200, noise=0.05, random_state=0) 
plt.scatter(X[:,0], X[:,1]) 
plt.tight layout() 
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K-means and hierarchical clustering: 


# complete linkage clustering 
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,3)) 


km = KMeans(n_clusters=2, random _state=0) 

y km = km.fit_predict(X) 

ax1.scatter(X[y_km==0,0], X[y_km==0,1], c='lightblue', marker='o' 
, S-40, label-'cluster 1') 

ax1.scatter(X[y_km==1,0], X[y km==1,1], c='red', marker='s', s=40 
, label='cluster 2') 

ax1.set title('K-means clustering') 


ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', 
linkage='complete' ) 

y_ac = ac.fit_predict(X) 

ax2.scatter(X[y_ac==0,0], X[y_ac==0,1], c='lightblue', marker='o' 
, S-40, label='cluster 1') 

ax2.scatter(X[y_ac==1,0], X[y_ac==1,1], c='red', marker='s', s=40 
, label='cluster 2') 

ax2.set_title('Agglomerative clustering') 


plt.legend( ) 
plt.tight layout() 
Hplt.savefig('./figures/kmeans and ac.png', dpi-300) 
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Aoglomerative clustering 


K-means clustering 
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效果 并 不 好 , 并 不 能 完全 separated 


Density-based clustering: 


# DBSCAN 
from sklearn.cluster import DBSCAN 


db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean' ) 

y_db = db.fit_predict(X) 

plt.scatter(X[y_db==0,0], X[y_db==0,1], c='lightblue', marker='o' 
, S-40, label='cluster 1') 

plt.scatter(X[y_db==1,0], X[y_db==1,1], c='red', marker='s', s=40 
, label='cluster 2') 

plt.legend( ) 

plt.tight layout() 

#plt.savefig('./figures/moons_dbscan.png', dpi-300) 
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DBSCAN 能 很 好 地 对 半月 数据 进行 聚 类 


VA AIA Ll A 70 
from sklearn.datasets import make_circles 
X, Y = make_circles(n_samples=500, 


factor=.6, 
noise=.05) 


plt.scatter(X[:, O], X[:, 1], c=y); 
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k-means 


from sklearn.cluster import KMeans 

km = KMeans(n_clusters=2) 

predict = km.fit predict(X) 
plt.scatter(X[:, 0], X[:, 1], c=predict); 
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allomerative clustering 


from sklearn.cluster import AgglomerativeClustering 
ac = AgglomerativeClustering() 

predict = ac.fit_predict(X) 

plt.scatter(X[:, 0], X[:, 1], c=predict); 
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DBSCAN 


from sklearn.cluster import DBSCAN 
db = DBSCAN(eps=0.15, 
min_samples=9, 
metric='euclidean') 
predict = db.fit_predict(X) 
plt.scatter(X[:, 0], X[:, 1], c=predict); 
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Learning from labeled and unlabeled data 
with label propagation 


[back to top] 


因为 标注 成 本 比较 高 ， 当 你 的 训练 数据 集 只 有 一 部 分 数据 是 有 标注 的 情况 下 ， 使 用 
监督 学 习 你 只 能 扔 掉 那 些 没有 标注 的 XX o 而 实际 上 ， "a 注 的 样本 和 无 标注 的 样 
本 之 间 是 有 关系 的 ， 这 种 关系 信息 也 可 以 用 来 帮助 学 习 。 这 就 是 半 监 督学 习 标 签 传 
4& (Label Propagation) 算法 的 思路 。 


它 的 基本 逻辑 是 借助 于 近 朱 者 赤 的 思路 ， 也 就 是 KNN 的 思路 ， 如 果 A 和 B 在 X 空 
闻 上 很 接近 ， 那 么 A 的 y 标签 就 可 以 传 给 B。 进一步 迭代 下 去 ， 如 果 C fe B 也 很 


接近 ，C 的 标签 也 应 该 和 BB 一样。 所 以 基本 计算 逻辑 就 是 两 步 ， 第 第 一 步 是 计算 样 
本 间 的 距离 ， 构 建 转移 矩阵 ， 第 二 步 是 将 转移 矩阵 和 Y HEERO Y 里 面包 括 了 已 


标注 和 未 标注 的 两 部 分 ， 通 过 相 来 可 以 将 已 标注 的 站 转播 给 未 标注 的 Y。 具体 论 
文 可 以 看 这 里 。 在 Sklearn 模块 中 已 经 内 置 了 这 种 算法 ， 文 档 示 例 可 以 看 这 里 
下 面 是 用 Python 的 NumPy 模块 实现 的 一 个 toy demo ° 


import pandas as pd 
import numpy as np 


from sklearn.datasets import load_iris 
iris = load iris() 

X = iris.data 

y = iris.target 

n = len(y) 


np.random.seed( 42) 
train_index = np.random.choice(n, int(0.6*n), replace = False) 
test_index = np.setdiffid(np.arange(n), train_index) 


sigma = X.var(axis = 0) 
weights = np.zeros((n,n)) 


def gh inc(ind1, ind2, X=X, sigma-sigma): 
return np.exp(-np.sum((X[ind1] - X[ind2])**2/sigma)) 


for 1 in range(n): 
for j in range(n): 
weights[i,j] = weight func(i,]j) 


t = weights / weights.sum(axis-i) 


y m - np.zeros((n, len(np.unique(y)))) 
for i in range(n): 
y m[i,y[i]] = 1 


y m[test index] - np.random.random(y m[test index].shape) 
clamp = y m[train index] 


iter n - 50 

for | in range(iter n): 
y m = t.dot(y m) 
ym = (y m.T / y m.sum(axis-i)).T 
y m[train index] - clamp 


predict = y m[test index].argmax(axis-i) 
np.sum(y[test index] -- predict) / float(len(predict)) 


0.91666666666666663 


Label Propagation learning a complex structure 


Example of LabelPropagation learning a complex internal structure to 
demonstrate “manifold learning”. The outer circle should be labeled “red” and the 
inner circle “blue”. Because both label groups lie inside their own distinct shape, 
we can see that the labels propagate correctly around the circle. 


import numpy as np 

import matplotlib.pyplot as plt 

%matplotlib inline 

from sklearn.semi_supervised import label propagation 
from sklearn.datasets import make_circles 


n_samples = 200 

X, y = make_circles(n_samples=n_samples, shuffle=False) 
outer, inner = 0, 1 

labels = -np.ones(n_samples) 

labels[0] = outer 

labels[-1] = inner 


label spread = label propagation.LabelSpreading(kernel-'knn', al 
pha=1.0) 
label_spread.fit(X, labels) 


output labels = label spread.transduction 
plt.figure(figsize=(8.5, 4)) 
plt.subplot(1, 2, 1) 
plt.scatter(X[labels == outer, 0], X[labels == outer, 1], color= 
'navy', 

marker='s', lw=0, label="outer labeled", s=10) 


plt.scatter(X[labels == inner, 0], X[labels == inner, 1], color= 
'c', 

marker='s', lw=0, label='inner labeled', s=10) 
plt.scatter(X[labels == -1, 0], X[labels == -1, 1], color='darko 
range', 

marker='.', label-'unlabeled') 
plt.legend(scatterpoints=1, shadow=False, loc='upper right' ) 
plt.title("Raw data (2 classes=outer and inner)") 


plt.subplot(1, 2, 2) 
output_label_array = np.asarray(output_labels) 
outer_numbers = np.where(output_label_array == outer)[0] 
inner_numbers = np.where(output_label array == inner)[0] 
plt.scatter(X[outer_numbers, 0], X[outer_numbers, 1], color='nav 
y', 
marker='s', lw=0, s=10, label="outer learned") 
plt.scatter(X[inner numbers, ©], X[inner_numbers, i], color='c', 
marker='s', lw=0, s=10, label="inner learned") 
plt.legend(scatterpoints=1, shadow=False, loc='upper right!) 
plt.title("Labels learned with Label Spreading (KNN)") 


plt.subplots_adjust(left=0.07, bottom=0.07, right=0.93, top=0.92 


) 
plt.show() 


T Raw data (2 classes=outer and inner) , Labels learned with Label Spreading [KNM] 
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Sections 


e Obtaining the IMDb movie review dataset 
e lext-feature-extraction 

o Bag-of-words model 

o Bigrams and N-Grams 

o Character n-grams 


O 


Tfidf encoding 

e Cleaning text data 

e Processing documents into tokens 

e Training a logistic regression model for sentiment classification 

e Working with bigger data - online algorithms and out-of-core learning 
e Model persistence 


e word2vec 


Obtaining the IMDb movie review dataset 


[back to top] 
数据 可 从 过 下 载 


解压 之 后 ， 下 面 代码 可 将 数据 读 成 Pandas 的 DataFrame 


cd Documents/shanghai python/data/aclImdb/ 
/Users/xiaokai/Documents/shanghai python/data/aclImdb 


ls 


README 
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xt 

10006 7.txt 
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xt 
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xt 
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t 
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12267 8.txt 


12268 /.txt 
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8399 10. 


839 /.tx 


a3 10.tx 


8400 /.t 


8401 10. 


3402 10. 


8403 10. 


8404 10. 


8405_9.T 


8406_10. 


8407_7.t 


8408 10. 


8409 8.t 


840 9. tx 


8410 10. 


opa lee 


10662_7.txt 

xt 

10663 8.txt 

xt 

10664 8.txt 

xt 

10665 8.txt 

xt 

10666 8.txt 

xt 

10667 8.txt 

xt 

10668 7.txt 

xt 

10669 10.txt 
xt 

1066 10.txt 

xt 

10670 10.txt 
xt 

10671 10.txt 
txt 

10672 9.txt 
txt 

10673 10.txt 
xt 

10674 8.txt 
xt 

10675 8.txt 
xt 

10676 9.txt 
xt 

10677 8.txt 
xt 

10678 9.txt 
xt 

10679 10.txt 
xt 

1067 7.txt 


t 


1662 /.txt 


1663 9.txt 


1664 10.txt 


1665 /.txt 


1666 8.txt 


1667 /.txt 


1668 8.txt 


1669 8.txt 


166 /.txt 


16/0 8.txt 


16/1 8.txt 


1672 8.txt 


16/3 8.txt 


16/4 8.txt 


16/5 9.txt 


16/6 9.txt 


16/7 9.txt 


16/8 9.txt 


16/9 9.txt 


167 7.txt 


3912 10.txt 


3913 10.txt 


3914 10.txt 


3915 9.txt 


3916 8.txt 


391/ 9.txt 


3918 10.txt 


3919 /.txt 


391 8.txt 


3920 9.txt 


3921 9.txt 


3922 10.txt 


3923 10.txt 


3924 7.txt 


3925 10.txt 


3926 /.txt 


392/ 9.txt 


3928 /.txt 


3929 9.txt 


392 9.txt 


6162 8.txt 


6163 10.txt 


6164 /.txt 


6165 8.txt 


6166 10.txt 


6167 /.txt 


6168 10.txt 


6169 10.txt 


616 /.txt 


6170 10.txt 


61/1 8.txt 


61/2 l.txt 


61/3 8.txt 


61/4 10.txt 


61/5 8.txt 


6176 9.txt 


61/7 l.txt 


61/8 /.txt 


61/9 8.txt 


617_7.txt 


8412_7.t 


8413_8.t 


8414 _7.t 


8415_7.t 


8416_7.t 


841/ 8.t 


8418 7.T 


8419 9.T 


841 10.t 


8420 9.t 


6421 10. 


0422 10. 


8423 8.t 


8424 9.T 


8425 9. 


8426 /.t 


8427 [l.t 


0428 /.t 


8429 7.T 


842 9.tX 


10680_8.txt 


xt 
10681 10.txt 
xt 
10682 10.txt 
txt 
10683 7.txt 
xt 
10684 9.txt 
xt 
10685 7.txt 
txt 
10686 8.txt 
txt 
10687 10.txt 
txt 
10688 9.txt 
txt 
10689 8.txt 
txt 
1068 10.txt 
xt 
10690 10.txt 
xt 
10691 7.txt 
txt 
10692_8.txt 
xt 
10693 8.txt 
xt 
10694 7.txt 
txt 
10695 8.txt 
txt 
10696 7.txt 
txt 
10697 8.txt 
xt 
10698 9.txt 


txt 


1680 8.txt 


1681 /.txt 


1682 7/.txt 


1683 /.txt 


1684 10.txt 


1685 10.txt 


1686 10.txt 


168/ 10.txt 


1688 9.txt 


1689 10.txt 


168 9.txt 


1690 10.txt 


1691 8.txt 


1692 8.txt 


1693 10.txt 


1694 10.txt 


1695 10.txt 


1696 10.txt 


1697/7 10.txt 


1698 10.txt 


3930 9.txt 


3931 /.txt 


3932 8.txt 


3933 /.txt 


3934 8.txt 


39339 CXE 


3936 /.txt 


393/ 8.txt 


3938 9. txt 


3939 /.txt 


393 8.txt 


3940 9.txt 


3941 9.txt 


3942 9.txt 


3943 10.txt 


3944 8.txt 


3945 10.txt 


3946 /.txt 


3947 /l.txt 


3948 9.txt 


6180 7/.txt 


6181 9.txt 


6182 /.txt 


6183 /.txt 


6184 /.txt 


6185 8.txt 


6186 10.txt 


618/ 9.txt 


6188 9.txt 


6189 10.txt 


618 10.txt 


6190 /.txt 


6191 9.txt 


6192 9.txt 


6193 /.txt 


6194 8.txt 


6195 10.txt 


6196 8.txt 


6197 10.txt 


6198 /.txt 


8430_9.T 


8431 9.t 


0432 10. 


8433_9.t 


8434_9.t 


8435_10. 


8436_10. 


0437 10. 


8438 10. 


8439 10. 


843 10.t 


8440 8.T 


6441 10. 


8442 9.T 


8443 9.t 


6444 10. 


8445_10. 


8446_10. 


8447_8.T 


8448 10. 


10699 9.txt 
xt 

1069 10.txt 
t 

106 10.txt 
xt 

10700 8.txt 
txt 

10701 10.txt 
xt 

10702 10.txt 
txt 

10703 7.txt 
xt 

10704 10.txt 
txt 

10705 7.txt 
txt 

10706 7.txt 
xt 

10707 8.txt 
txt 

10708 8.txt 
txt 

10709 10.txt 
t 

1070 8.txt 
xt 

10710 9.txt 
xt 

10711 10.txt 
xt 

10712 8.txt 
xt 

10713 9.txt 
txt 

10714 8.txt 
xt 

10715 8.txt 


xt 


1699 10.txt 


169 8.txt 


16 /.txt 


1700 8.txt 


1/01 10.txt 


1/02 9.txt 


1/03 8.txt 


1/04 8.txt 


1/05 10.txt 


1706 9.txt 


1/07 10.txt 


1708 10.txt 


1/09 8.txt 


170 10.txt 


1/10 /.txt 


1/11 8.txt 


17129 EXE 


1/13_8.txt 


1/14 8.txt 


1/15_8.txt 


3949 8.txt 


394 8.txt 


3950 /.txt 


3951 10.txt 


3952 10.txt 


3953 10.txt 


3954 9.txt 


3955 9. txt 


3956 10. txt 


395/ 10. txt 


3958 /.txt 


3959 9.txt 


395 10.txt 


3960 10.txt 


3961 8.txt 


3962 10.txt 


3963 /.txt 


3964 7.txt 


3965 8.txt 


3966 9.txt 


6199 8. 


txt 


619 9.txt 


61 10.txt 


6200 7. 


6201 9. 


6202 /. 


6203 7. 


6204 /. 


6205 8. 


6206 8. 


620/ 9. 


6208 /. 


6209 8. 


620 10. 


6210 10.txt 


6211 8. 


6212 9. 


6213 /. 


6214 /. 


04215 <, 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


8449 9.t 


844 8.tx 


8450 /.t 


9451 10. 


8452 [.t 


8453 10. 


8454 8.t 


3455 10. 


8456_10. 


845/_9.t 


8458_10. 


8459 10. 


945 /7 .七 X 


8460_7.t 


8461 /.t 


8462 9.t 


8463 9.t 


8464 10. 


8465_8.t 


8466_9.t 


10716 7.txt 
xt 


10/17 10.txt 


xt 

10718 10.txt 
xt 

10719 10.txt 
t 

1071 8.txt 
xt 

10720 9.txt 
txt 

10721 9.txt 
txt 

10722 10.txt 
xt 

10723 8.txt 
txt 

10724 8.txt 
txt 

10725 9.txt 
txt 

10726 7.txt 
txt 

10727 7.txt 
xt 

10728 10.txt 
xt 

10729 8.txt 
t 

1072 10.txt 
txt 

10730 10.txt 
xt 

10731 7.txt 
xt 

10732 8.txt 
xt 

10733 7.txt 


txt 


1/16 8.txt 


1/17 8.txt 


1/18 /.txt 


1/19 /.txt 


1/1 8.txt 


1/20 10.txt 


1/21 8.txt 


1/22 l.txt 


T723- 7. xt 


1/24 10.txt 


1/25 8.txt 


1/26 8.txt 


1/21 10.txt 


1/28 /l.txt 


1/29 8.txt 


1/2 10.txt 


1730 10.txt 


1/31 10.txt 


1/32 10.txt 


1/33 f.txt 


396/ 7.txt 


3968 10.txt 


3969 8.txt 


396 8.txt 


39/0 7.txt 


39/1 8.txt 


39/2 f.txt 


397337 CXT 


39/4 8.txt 


39/5 10.txt 


39/6 10.txt 


39// 10.txt 


39/8 10.txt 


39/9 10.txt 


SOT 


3980 10.txt 


3981 10.txt 


3982 /.txt 


3983 10.txt 


3984 7/.txt 


6216 /.txt 


621/ 8.txt 


6218 /.txt 


6219 7.txt 


621 10.txt 


6220 8.txt 


6221 /.txt 


6222 9.txt 


6223 f.txt 


6224 /.txt 


6225 8.txt 


6226 10.txt 


6227 l.txt 


6228 /.txt 


6229 7.txt 


622 10.txt 


6230 8.txt 


6231 10.txt 


6232 10.txt 


6233 9.txt 


846/_7.t 


a468 /.t 


8469 /.t 


846 /7 ,七 X 


84/0 8.t 


8471 10. 


04/2 10. 


84/3 8.t 


84/4 10. 


04/5 10. 


04/6 10. 


84// 310. 


84/8_8.t 


8479 9.T 


847/_7.TX 


8480 10. 


8481 8.t 


8482 [.t 


8483 /.t 


6484 10. 


10/34 10.txt 


xt 

10735 10.txt 
txt 

10736 10.txt 
txt 

10737 10.txt 
xt 

10738 9.txt 
xt 

10739 10.txt 
t 

1073 9.txt 
xt 

10740 8.txt 
xt 

10741 10.txt 
xt 

10742 9.txt 
txt 

10743 9.txt 
xt 

10744 8.txt 
xt 

10745 10.txt 
xt 

10746 10.txt 
xt 

10747 10.txt 
xt 

10748 10.txt 
xt 

10749 8.txt 
t 

1074 10.txt 
t 

10750 8.txt 
xt 


10751 10.txt 
xt 


1/34 f.txt 


1/35 10.txt 


1736 10.txt 


1/37 8.txt 


1/38 /.txt 


1739 10. txt 


1/3 f.txt 


1/40 8. txt 


1/41 9.txt 


1/42 l.txt 


1/43 9.txt 


1/44 10.txt 


1/45 f.txt 


1/46 /.txt 


1/47 10.txt 


1/48 8.txt 


1/49 10.txt 


1/4 7.txt 


1/50 10.txt 


1/51 8.txt 


3985 8. 


3986 10.txt 


398/ T. 


3988 /. 


3989 8. 


398 10. 


3990 8. 


3991 <. 


3992 8. 


399987: 


3994 8. 


3995 8. 


3996 9. 


3997 10.txt 


3998 10.txt 


3999 10.txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


399 9.txt 


SOLO CXE 


3_10.txt 


4000_10.txt 


6234_8. 


6235_8. 


6236_8. 


6237/_7. 


6238_9. 


6239_8. 


623_10. 


6240 10.txt 


6241 10.txt 


6242 8. 


6243 9. 


6244 8. 


6245 10.txt 


6246 9. 


6247 10.txt 


6248 f. 


6249 /. 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


624 9.txt 


6250 10.txt 


60429 7 


txt 


8485_8.t 


3486 10. 


8487 10. 


8488_8.t 


8489_9.t 


848 8.tX 


8490 /.t 


8491 /.t 


8492 9.t 


8493 10. 


8494 8.t 


8495 8.t 


3496 /.t 


8497 [.t 


a498 9.T 


8499 9.T 


849 /7 ,七 X 


a4 10.tx 


8500 /.t 


8501 8.t 


10/52 10.txt 


xt 

10753 10.txt 
xt 

10754 10.txt 
xt 

10755 10.txt 
xt 

10756 8.txt 
xt 

10757 10.txt 
txt 

10758 8.txt 
xt 

10759 9.txt 
xt 

1075 10.txt 
t 

10760 8.txt 
xt 

10761 10.txt 
txt 

10762 10.txt 
xt 

10763 8.txt 
xt 

10764 9.txt 
xt 

10765 10.txt 
xt 

10766 7.txt 
xt 

10767 10.txt 
xt 

10768 7.txt 
txt 

10769 10.txt 
txt 

1076 8.txt 


t 


1/52 8.txt 


17/53 9.txt 


1/54 10.txt 


1/55 10.txt 


1/56 /.txt 


1/5/ 8.txt 


1/58 10.txt 


1/59 8.txt 


1/5 f.txt 


1760 10.txt 


1/61 9.txt 


1762 _7.txt 


1763 9. txt 


1/64 10.txt 


1/65 8. txt 


1766 10.txt 


1/67 8.txt 


17/68 9. txt 


1769 8. txt 


1/6 7.txt 


4001 8. txt 


4002 8. txt 


4003 9. txt 


4004 9. txt 


4005 10. txt 


4006 8. txt 


4007/7 9. txt 


4008 9. txt 


4009 9. txt 


400 10. txt 


4010 10. txt 


4011 8.txt 


4012 8.txt 


4013 /.txt 


4014 10.txt 


4015 /.txt 


4016 10.txt 


4017 /.txt 


4018 /.txt 


4019 /.txt 


6252 10.txt 


6253 8.txt 


6254 10.txt 


6255 8.txt 


6256 8.txt 


625/ 10.txt 


6258 8.txt 


6259 8.txt 


625 10.txt 


6260 10.txt 


6261 8.txt 


6262 9.txt 


6263 /.txt 


6264 7.txt 


6265 /.txt 


6266 8.txt 


626/ 10.txt 


6268 9.txt 


6269 10.txt 


626 9.txt 


8502 9. 


8503 8.t 


8504 8.t 


8505 8.t 


8506 8.t 


850/ 10. 


8508 /.t 


8509 /.t 


850 8.tX 


8510 /.t 


8511 10. 


8512 /.t 


8513 9.t 


8514 /.t 


8515 8.t 


8516 9.t 


851/ 8.t 


6518 10. 


8519 10. 


851 /.tx 


10770_7.txt 


xt 

10771 10.txt 
txt 

10772 10.txt 
txt 

10773 9.txt 
xt 

10774 8.txt 
txt 

10775 8.txt 
txt 

10776 8.txt 
txt 

10777 9.txt 
xt 

10778 8.txt 
xt 

10779 10.txt 
xt 

1077 8.txt 

t 

10780 10.txt 
xt 

10781 10.txt 
xt 

10782 7.txt 
xt 

10783 10.txt 
xt 

10784 10.txt 
txt 

10785 10.txt 
xt 

10786 10.txt 
xt 

10787 10.txt 
txt 


10/88 10.txt 
txt 


17/0 10.txt 


17/1 10.txt 


17/2 10.txt 


17/3 9.txt 


1//4 10.txt 


1775 :9Xt 


17/6 10.txt 


17/// 10.txt 


17/8 10.txt 


17/9 10.txt 


1/1 9.txt 


1780 10.txt 


1/81 10.txt 


1/82 l.txt 


1/83 8.txt 


1/84 10.txt 


1/85 9.txt 


1/86 /.txt 


1/81 10.txt 


1/88 /.txt 


401 10.txt 


4020 10.txt 


4021 /.txt 


4022 8.txt 


4023 9.txt 


4024 9.txt 


4025 9.txt 


4026 9.txt 


4027 10.txt 


4028 10.txt 


4029 10.txt 


402 10.txt 


4030 9.txt 


4031 10.txt 


4032 9.txt 


4033 9.txt 


4034 9.txt 


4035 10.txt 


4036 /.txt 


4037 f.txt 


6270 8. 


6271 9. 


60272 ㆍ. 


62/3_9. 


62/4_8. 


6279 7. 


62/6_7. 


62//_1. 


6278_/. 


6279 10.txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


627 8.txt 


6280 /. 


6281 8. 


6282 f. 


6283 7. 


6284 /. 


6285 /. 


6286 8. 


628/ 10.txt 


6288 7. 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


8520 8.t 


8521 10. 


8522 10. 


SIZE 


6524 10. 


8525 10. 


8526 10. 


852/ 8.t 


8528 /.t 


8529 9.t 


852 /.tx 


8530 9.t 


8531 /.t 


8532 /.t 


8533 8.t 


8534 10. 


38939 .< 


8036 9.t 


893/_10. 


8938_10. 


10/89 10.txt 


xt 

1078 8.txt 
t 

10790 8.txt 
xt 

10791 9.txt 
txt 

10792 9.txt 
xt 

10793 10.txt 
xt 

10794 10.txt 
xt 

10795 7.txt 
xt 

10796 9.txt 
txt 

10797 8.txt 
txt 

10798 8.txt 
txt 

10799 7.txt 
xt 

1079 7.txt 
t 

107 10.txt 
xt 

10800 8.txt 
xt 

10801 8.txt 
xt 

10802 8.txt 
xt 

10803 8.txt 
xt 


10804 10.txt 


xt 


10805 10.txt 


xt 


1/89 /.txt 


1/8 7.txt 


99089 EXE 


1/91 8.txt 


1/92 10.txt 


1793 9.txt 


1/94 /.txt 


1/95 8.txt 


1/96 8.txt 


1797 9.txt 


1/98 9.txt 


1/99 /.txt 


1/9 8.txt 


ASE 


1800_8.txt 


1801_8.txt 


1802 9.txt 


1803 10.txt 


1804 10.txt 


1805 10.txt 


4038 10.txt 


4039 10.txt 


403 8.txt 


4040 8.txt 


4041 /.txt 


4042 f.txt 


4043 f.txt 


4044 9.txt 


4045 10. txt 


4046 8. txt 


4047 10. txt 


4048 /.txt 


4049 7/.txt 


404 9.txt 


4050 9.txt 


4051 8.txt 


4052 8.txt 


4053 8.txt 


4054 9.txt 


4055 10.txt 


6289 10.txt 


628 9.txt 


6290 7.txt 


6291 10.txt 


6292 10.txt 


6293 10.txt 


6294 10.txt 


6295 /.txt 


6296 /.txt 


629/ 10.txt 


6298 9.txt 


6299 8.txt 


629 9.txt 


62 10.txt 


6300 8.txt 


6301 10.txt 


6302 8.txt 


6303 8.txt 


6304 8.txt 


6305 10.txt 


8539 8.t 


853 9. EX 


8940 9.t 


8541 10. 


8542 9.t 


8543 8.t 


8544 9.t 


86945 9.t 


8946 10. 


8547 10. 


8948_10. 


8549 8.t 


854 9 ,七 X 


8550 8.t 


8551 8.t 


85592 9 £ 


9559 .< 


8954_7.t 


890909 9.t 


8906 /.t 


10806_9.txt 
xt 

10807 9.txt 
txt 

10808 10.txt 
xt 

10809 10.txt 
t 

1080 9.txt 
txt 

10810 8.txt 
txt 

10811 7.txt 
txt 

10812 8.txt 
txt 

10813 10.txt 
xt 

10814 7.txt 
xt 

10815 10.txt 
txt 

10816 10.txt 
txt 

10817 10.txt 
xt 

10818 10.txt 
xt 

10819 10.txt 
t 

1081 10.txt 
xt 

10820 10.txt 
xt 

10821 8.txt 
xt 

10822 10.txt 
txt 

10823 8.txt 


xt 


1806 8.txt 


1807 /.txt 


1808 /.txt 


1809 10.txt 


180 9.txt 


1810 /.txt 


1811 10.txt 


1812 10.txt 


1813 8.txt 


1814 10.txt 


1815 10.txt 


1816 9.txt 


181/ 8.txt 


1818 8.txt 


1819 9.txt 


181 10.txt 


1820 9.txt 


1821 8.txt 


1822 8.txt 


1823 /.txt 


4056 7.txt 


4057 f.txt 


4058 10.txt 


4059 8.txt 


405 10.txt 


4060 10.txt 


4061 10.txt 


4062 10.txt 


4063 8.txt 


4064 10.txt 


4065 10.txt 


4066 10.txt 


406/ 8.txt 


4068 10.txt 


4069 10.txt 


406 8.txt 


40/0 10.txt 


40/1 10.txt 


40/2 10.txt 


4073 10.txt 


6306 10.txt 


630/ 8.txt 


6308 8.txt 


6309 8.txt 


630 10.txt 


6310 10.txt 


6311 10.txt 


6312 /.txt 


6313 9.txt 


6314 9.txt 


6315 10.txt 


6316 8.txt 


631/ 10.txt 


6318 /.txt 


6319 10.txt 


631 10.txt 


6320 10.txt 


6321 /.txt 


6322 7/.txt 


6323 /.txt 


855/_7.T 


8558 10. 


8559 8.t 


855_9. TX 


8560_10. 


8561 10. 


8562 10. 


8563 10. 


8564 /.t 


8065 /.t 


8066 10. 


856/_10. 


8568 /.t 


8569 9.t 


856 7 EX 


85/0_7.T 


85/1_7.T 


85/2_7.T 


89/3_10. 


89/4_9.t 


10824_10.txt 
txt 
10825_9.txt 
xt 
10826_10.txt 
xt 
10827 10.txt 
xt 
10828 10.txt 
xt 
10829 10.txt 


t 

1082 10.txt 

xt 

10830 10.txt 
txt 

10831 7.txt 

xt 

10832 10.txt 
xt 

10833 10.txt 
xt 

10834 7.txt 

xt 

10835 10.txt 
xt 

10836 10.txt 
xt 

10837 10.txt 
xt 

10838 10.txt 
xt 

10839 10.txt 
t 

1083 10.txt 

xt 

10840 9.txt 

xt 


10841 10.txt 
xt 


1824 8.txt 


1825 10.txt 


1826 10.txt 


18217 10.txt 


1828 8.txt 


1829 8.txt 


182 10.txt 


1830 8.txt 


1831 10.txt 


1832 f.txt 


1833 10.txt 


1834 8.txt 


1835 10.txt 


1836 /.txt 


183/ 10.txt 


1838 10.txt 


1839 10.txt 


183 8.txt 


1840 10.txt 


1841 /.txt 


40/4 10.txt 


40/5 10.txt 


4076 10.txt 


407/77 10.txt 


40/8 10.txt 


4079 9.txt 


407_10.txt 


4080_10.txt 


4081_10.txt 


4082_10.txt 


4083_10.txt 


4084_7.txt 


4085_10.txt 


4086 /.txt 


408/ 10.txt 


4088 /.txt 


4089 8. txt 


408 10. txt 


4090 8. txt 


4091 /.txt 


6324 8.txt 


6325 /.txt 


6326 10. txt 


632/ 10. txt 


6328 10. txt 


6329 10. txt 


632 10. txt 


6330 10. txt 


6331 10. txt 


6332 8.txt 


6333 8.txt 


6334 8.txt 


6335 10.txt 


6336 /.txt 


633/_7.txt 


6338_9.txt 


6339 7.txt 


633_8.txt 


6340 10.txt 


6341 /.txt 


89/5_10. 


85/6 8.t 


85//_8.T 


8578 8.t 


85/9 8.t 


85/_8.TXxX 


8580 9.t 


8581 10. 


8582 9. 


8583 9.Ｌ 


8584 8.t 


8585 8.t 


8586 8.t 


858/_7.T 


8588 /.t 


8589 /.t 


858 8. tX 


8590 /.t 


8591 /.t 


8592 /.t 


10842 _7.txt 


xt 
10843_7.txt 
xt 

10844 9.txt 
txt 

10845 10.txt 
xt 

10846 9.txt 
xt 

10847 10.txt 
xt 

10848 10.txt 
xt 

10849 10.txt 
t 

1084 9.txt 

t 

10850 10.txt 
xt 

10851 9.txt 
txt 

10852 10.txt 
txt 

10853 10.txt 
txt 
10854_10.txt 
txt 
10855_9.txt 
txt 
10856_8.txt 
txt 
10857_8.txt 
xt 
10858_8.txt 
xt 

10859 7.txt 
txt 

1085 7.txt 


t 


1842 9.txt 


1843 9.txt 


1844 8.txt 


1845 /.txt 


1846 8.txt 


1847 10.txt 


1848 8.txt 


1849 /.txt 


184 8.txt 


1850 10.txt 


1851 10.txt 


1852 9.txt 


1853 8.txt 


1854 10.txt 


1855 9.txt 


1856 9.txt 


185/ 10.txt 


1858 10.txt 


1859 8.txt 


185 9.txt 


4092 10.txt 


4093 8.txt 


4094 9.txt 


4095 8.txt 


4096 10.txt 


4097 8.txt 


4098 10.txt 


4099 8.txt 


409 10.txt 


40 8.txt 


4100 10.txt 


4101 8.txt 


4102 10.txt 


4103 /.txt 


4104 9.txt 


4105 10.txt 


4106 /.txt 


4107 10.txt 


4108 /.txt 


4109 10.txt 


6342 10.txt 


6343 10.txt 


6344 7/.txt 


6345 /.txt 


6346 10.txt 


6347 9.txt 


6348 10.txt 


6349 10.txt 


634 8.txt 


6350 8.txt 


6351 8.txt 


6352 10.txt 


6353 10.txt 


6354 10.txt 


6355 8.txt 


6356 8.txt 


635/ 9.txt 


6358 10.txt 


6359 10.txt 


635 9.txt 


8593 /.t 


8594 /.t 


8595 10. 


8596 /.t 


859/ 9.t 


8098 /.t 


09099957E 


859 9, tx 


835 10. tx 


8600 8.t 


8601 10. 


8602 10. 


8603 10. 


8604 10. 


8605 10. 


8606 10. 


96007 9.t 


8608 9.t 


8609 10. 


860 8. tX 


10860_7.txt 


txt 
10861_7.txt 
xt 
10862_9.txt 
xt 

10863 8.txt 
xt 

10864 8.txt 
xt 

10865 7.txt 
txt 

10866 7.txt 
xt 

10867 7.txt 
xt 

10868 8.txt 
xt 

10869 7.txt 
xt 

1086 7.txt 
t 

10870 8.txt 
xt 

10871 7.txt 
txt 

10872 7.txt 
xt 

10873 8.txt 
xt 

10874 10.txt 
xt 

10875 8.txt 
xt 

10876 7.txt 
xt 

10877 10.txt 
txt 

10878 7.txt 


txt 


1860 9.txt 


1861 10.txt 


1862 8.txt 


1863 10.txt 


1864 10.txt 


1865 8.txt 


1866 8.txt 


186/ 9.txt 


1868 10.txt 


1869 9.txt 


186 8.txt 


18/0 10.txt 


18/1 10.txt 


18/2 l.txt 


18/3 8.txt 


18/4 10.txt 


18/5 10.txt 


18/6 10.txt 


18/7 l.txt 


18/8 10.txt 


410 8.txt 


4110 10.txt 


4111 10.txt 


4112 9.txt 


4113 /.txt 


4114 9.txt 


4115 8.txt 


4116 9.txt 


4117 9.txt 


4118 10.txt 


4119 8.txt 


411 10.txt 


4120 9.txt 


4121 10.txt 


4122 10.txt 


4123 /.txt 


4124 8.txt 


4125 8.txt 


4126 8.txt 


4127 8.txt 


6360 /.txt 


6361 10.txt 


6362 /.txt 


6363 8.txt 


6364 8.txt 


6365 /.txt 


6366 8.txt 


636/ /.txt 


6368 /.txt 


6369 10.txt 


636 10.txt 


63/0 8.txt 


6371 9.txt 


6372 f.txt 


63/3 10.txt 


6374 10.txt 


63/5 8.txt 


6376 10.txt 


63// 9.txt 


63/8 8.txt 


8610 10. 


8611 8.t 


8612 /.t 


8613 /.t 


8614 9.t 


8615 10. 


8616 8.t 


861/ 9.t 


8618 8.t 


8619 /.t 


861 7 ,七 X 


8620 8.t 


8621 10. 


8622 8.t 


8623 /.t 


8624 /.t 


8625 /.t 


8626 /.t 


862/ 10. 


8628 10. 


108/9 10.txt 


xt 

1087 10.txt 
t 

10880 8.txt 
xt 

10881 7.txt 
xt 

10882 8.txt 
xt 

10883 7.txt 
xt 

10884 8.txt 
xt 

10885 7.txt 
xt 

10886 10.txt 
xt 

10887 7.txt 
txt 

10888 8.txt 
xt 

10889 10.txt 
xt 

1088 9.txt 
xt 

10890 9.txt 
txt 

10891 7.txt 
xt 

10892 7.txt 
xt 

10893 8.txt 
xt 

10894 8.txt 
xt 

10895 7.txt 
xt 

10896 8.txt 


xt 


18/9 10.txt 


18/_8.txt 


1880_10.txt 


1881 /.txt 


1882 10.txt 


1883 /.txt 


1884 8.txt 


1885 10.txt 


1886 10.txt 


188/ 8.txt 


1888 8.txt 


1889 10.txt 


188 /.txt 


1890 10.txt 


1891 8.txt 


1892 8.txt 


1893 10.txt 


1894 8.txt 


1895 10.txt 


1896 8.txt 


4128 10.txt 


4129 10.txt 


412 8.txt 


4130 9.txt 


4131 /.txt 


4132 /.txt 


4133 /.txt 


4134 f.txt 


4135 10.txt 


4136 10.txt 


4137 8.txt 


4138 10.txt 


4139 8.txt 


413 10.txt 


4140 10.txt 


4141 10.txt 


4142 10.txt 


4143 9.txt 


4144 10.txt 


4145 10.txt 


63/9 9.txt 


63/_10.txt 


6380 9.txt 


6381 8.txt 


6382 /.txt 


6383 10.txt 


6384 9.txt 


6385 10.txt 


6386 10.txt 


638/ 8.txt 


6388 /.txt 


6389 9.txt 


638 10.txt 


6390 8.txt 


6391 8.txt 


6392 10.txt 


6393 8.txt 


6394 10.txt 


6395 9.txt 


6396 9.txt 


8629 9.T 


862 7.tX 


8630 9.t 


8631 /.t 


8632 9.t 


8633 8.t 


8634 9.t 


8635 /.t 


8636 /.t 


G63/ 10. 


8038 /.t 


8639 9.t 


863 10.t 


8640 10. 


8641 /.t 


8642 8.t 


8643 /.t 


8644 7/.t 


8645_8.t 


8646_7.t 


10897_9.txt 
xt 
10898_7.txt 
xt 

10899 10.txt 
xt 

1089 10.txt 
t 

108 10.txt 
txt 

10900 8.txt 
xt 

10901 10.txt 
xt 

10902 10.txt 
txt 

10903 10.txt 
xt 

10904 10.txt 
xt 

10905 8.txt 
xt 

10906 10.txt 
xt 

10907 9.txt 
xt 

10908 10.txt 
xt 

10909 10.txt 
t 

1090 8.txt 
xt 

10910 10.txt 
xt 

10911 7.txt 
xt 

10912 7.txt 
xt 

10913 7.txt 


xt 


1897/7 10.txt 


1898 9.txt 


1899 7/.txt 


189 9.txt 


18 /.txt 


1900 8.txt 


1901 /.txt 


1902 10.txt 


1903 10.txt 


1904 /.txt 


1905 10.txt 


1906 10.txt 


1907 /.txt 


1908 9.txt 


1909 10.txt 


190 10.txt 


1910 9.txt 


1911 10.txt 


1912 9. CXE 


1913_10.txt 


4146_10.txt 


4147_8.txt 


4148 /.txt 


4149 10.txt 


414 10.txt 


4150 10.txt 


4151 10.txt 


4152 10.txt 


4153 10.txt 


4154 10.txt 


4155 10.txt 


4156 10.txt 


4157 8.txt 


4158 10.txt 


4159 9.txt 


415 /.txt 


4160 9.txt 


4161 8.txt 


4162 10.txt 


4163 9.txt 


6397 8.txt 


6398 8.txt 


6399 7.txt 


639 10.txt 


63 10.txt 


6400 9.txt 


6401 8.txt 


6402 8.txt 


6403 8.txt 


6404 8.txt 


6405 /.txt 


6406 9.txt 


6407/7 10.txt 


6408 10.txt 


6409 7.txt 


640 10.txt 


6410 10.txt 


6411 10.txt 


6412 9.txt 


6413 /.txt 


864/_7.T 


8648 9.T 


8649 9.T 


864 9. tx 


8650 10. 


8651 9.t 


8652 8.t 


8653 10. 


8694 9.t 


8655_9.t 


8656_9.t 


865/_9.T 


8658 8.t 


8659 /.t 


865_9. Tx 


8660 /.t 


8661 8.t 


8662 8.t 


8663 /.t 


8664 8.t 


10914 8.txt 

xt 

10915 9.txt 

xt 

10916 7.txt 

xt 

10917 8.txt 

xt 

10918 8.txt 

xt 

10919 10.txt 
t 

1091 10.txt 

xt 

10920 10.txt 
xt 

10921 10.txt 
xt 

10922 10.txt 
xt 

10923 9.txt 

xt 

10924 10.txt 
xt 

10925 9.txt 

xt 

10926 9.txt 

xt 

10927 10.txt 
xt 

10928 7.txt 

xt 

10929 10.txt 
t 

1092 10.txt 

xt 

10930 8.txt 

xt 

10931 7.txt 


xt 


1914 /.txt 


1915 10.txt 


1916 8.txt 


1917 9.txt 


1918 9.txt 


1919 8.txt 


EOS 606 


1920 10.txt 


1921 CXE 


1922 9.txt 


1923 10.txt 


1924 10.txt 


1925 9.txt 


1926 10.txt 


192/ 8.txt 


1928 10.txt 


1929 10.txt 


92 9 EXEt 


1930_10.txt 


1931_8.txt 


4164 10.txt 


4165 8.txt 


4166 9.txt 


4167 10.txt 


4168 /.txt 


4169 /.txt 


416 8.txt 


41/0 8.txt 


41/1 /.txt 


41/2 10.txt 


41/3 10.txt 


41/4 9.txt 


41/5 8.txt 


41/6 10.txt 


41/7 9.txt 


41/8 10.txt 


41/9 10.txt 


417_7.txt 


4180_10.txt 


4181_9.txt 


6414_8.txt 


6415_8.txt 


6416 /.txt 


6417 /.txt 


6418 10.txt 


6419 10.txt 


641 8.txt 


6420 7.txt 


6421 8.txt 


6422 /7.txt 


6423 7.txt 


6424 10.txt 


6425 9.txt 


6426 8.txt 


6427 10.txt 


6428 8.txt 


6429 7.txt 


642 10.txt 


6430 10.txt 


6431 8.txt 


8665 8.t 


8666 /.t 


866/_7.t 


8668_9.t 


8669 /.t 


866 8. tX 


86/0 8.t 


86/1 /.t 


86/2 8.t 


86/3 /.t 


86/4 9.t 


86/5 8.t 


86/6 8.t 


86/7/_9.t 


86/8_9.t 


86/9 8.t 


86/ 8.tX 


8680 9.t 


8681 8.t 


8682 /.t 


10932 7.txt 
xt 


10933 10.txt 


txt 

10934 10.txt 
xt 

10935 7.txt 
xt 

10936 8.txt 
xt 

10937 9.txt 
xt 

10938 10.txt 
xt 

10939 8.txt 
t 

1093 8.txt 
xt 

10940 10.txt 
xt 

10941 7.txt 
txt 

10942 9.txt 
txt 

10943 10.txt 
txt 

10944 8.txt 
xt 

10945 9.txt 
txt 

10946 7.txt 
xt 

10947 8.txt 
txt 
10948_10.txt 
txt 

10949 8.txt 
t 

1094 9.txt 


t 


1932 10.txt 


1933 8.txt 


1934 9.txt 


1935 10.txt 


1936 10.txt 


193/ 10.txt 


1938 9.txt 


1939 8.txt 


193 7/.txt 


1940 10.txt 


1941 9.txt 


1942 10.txt 


1943 10.txt 


1944 8.txt 


1945 8.txt 


1946 9.txt 


1947 8.txt 


1948 10.txt 


1949 8.txt 


194 8.txt 


4182 10.txt 


4183 10.txt 


4184 8.txt 


4185 8.txt 


4186 /.txt 


4187 /.txt 


4188 8.txt 


4189 8.txt 


418 9.txt 


4190 10.txt 


4191 10.txt 


4192 10.txt 


4193 8.txt 


4194 10.txt 


4195 9.txt 


4196 9.txt 


4197 8.txt 


4198 /.txt 


4199 /.txt 


419 7.txt 


6432 9.txt 


6433 7.txt 


6434 /.txt 


6435 /.txt 


6436 10.txt 


643/ 9.txt 


6438 10.txt 


6439 10.txt 


643 10.txt 


6440 8.txt 


6441 9.txt 


6442 9.txt 


6443 9.txt 


6444 8.txt 


6445 10.txt 


6446 10.txt 


6447 8.txt 


6448 10.txt 


6449 10.txt 


644 9.txt 


8683 9.t 


8684 10. 


8685 9.t 


8686 8.t 


868/_9.T 


8688 8.t 


8689 /.t 


868_8. Tx 


8690 8.t 


8691 /.t 


8692 10. 


8693 10. 


8694 10. 


8695 9.t 


8696 10. 


869/_7.t 


8698 10. 


8699 10. 


869 7/.tX 


86 10.tx 


10950_9.txt 
xt 
10951 8.txt 
xt 
10952 10.txt 
txt 
10953 10.txt 
txt 


10954 10.txt 
xt 


10955 7.txt 
xt 

10956 9.txt 
xt 

10957 10.txt 
txt 

10958 10.txt 
xt 

10959 10.txt 
xt 

1095 9.txt 
xt 

10960 8.txt 
xt 

10961 9.txt 
xt 

10962 10.txt 
xt 

10963 7.txt 
txt 

10964 7.txt 
txt 

10965 9.txt 
txt 

10966 9.txt 
txt 

10967 10.txt 
xt 

10968 7.txt 


xt 


1950 8.txt 


T9519 CXE 


1952 8.txt 


1953 10.txt 


1954 10.txt 


T955 90 EXE 


1956 8. txt 


1957 <. 


1958 10. txt 


1959 f.txt 


195 8. txt 


1960 /.txt 


1961 9. txt 


1962 10. txt 


1963 8. txt 


1964 /.txt 


1965 /.txt 


1966 10.txt 


196/ 8.txt 


1968 8.txt 


41 9.txt 


4200 9.txt 


4201 9.txt 


4202 10.txt 


4203 8.txt 


4204 10.txt 


4205 8.txt 


4206 9.txt 


4207 f.txt 


4208 10.txt 


4209 7.txt 


420 7/.txt 


4210 9.txt 


4211 8.txt 


4212 10.txt 


4213 /.txt 


4214 10.txt 


4215 9.txt 


4216 10.txt 


4217 9.txt 


6450 10.txt 


6451 8.txt 


6452 9.txt 


6453 10.txt 


6454 9.txt 


6455 /.txt 


6456 10.txt 


6457 f.txt 


6458 8.txt 


6459 10.txt 


645 10.txt 


6460 10.txt 


6461 10.txt 


6462 10.txt 


6463 8.txt 


6464 9.txt 


6465 9.txt 


6466 10.txt 


6467 l.txt 


6468 /.txt 


8/00 /.t 


8/01 8.t 


3/02 10. 


8/03 10. 


8/04 9.t 


8/05 8.t 


8/06 8.t 


90/07 10. 


8/08_8.t 


8/09 8.t 


8/0 10.t 


8/10 /.t 


801. 


87/12 8.t 


9/13 10. 


9/14 10. 


9715 10. 


9/16 10. 


8/17_9.t 


8/18_9.t 


10969 10.txt 
txt 

1096 9.txt 

t 

10970 7.txt 
xt 

10971 9.txt 
txt 

10972 10.txt 
xt 


10973 8.txt 

xt 

10974 10.txt 
txt 

10975 10.txt 
txt 

10976 10.txt 
xt 

10977 8.txt 

xt 

10978 10.txt 
xt 

10979 9.txt 

txt 

1097 9.txt 

t 

10980 7.txt 
xt 

10981 8.txt 
xt 

10982 10.txt 
xt 

10983 10.txt 
txt 

10984 10.txt 
xt 

10985 9.txt 
xt 


10986 10.txt 
txt 


1969 10.txt 


196 9.txt 


1970 9.txt 


19/1 9.txt 


19/2 10.txt 


1973 8.txt 


19/4 8.txt 


197/75 /.txt 


1976 10.txt 


19/17 10.txt 


1978 9.txt 


1979 9.txt 


197 9.txt 


1980 10.txt 


1981 9.txt 


1982 10.txt 


1983 10.txt 


1984 10.txt 


1985 10.txt 


1986 10.txt 


4218 8.txt 


4219 /.txt 


427 ENF EXE 


4220 f.txt 


4221 8.txt 


4222 /.txtT 


4223 8.txt 


4224 10.txt 


4225 10.txt 


4226 10.txt 


4227 10.txt 


4228 10.txt 


4229 10.txt 


422 7/.txt 


4230 10.txt 


4231 /.txt 


4232 f.txt 


4233 7.txt 


4234 f.txt 


4235 f.txt 


6469 9.txt 


646 9.txt 


6470 10.txt 


6471 10.txt 


6472 10.txt 


6473 8.txt 


64/4 8.txt 


6475 10.txt 


6476 10.txt 


6477_7.txt 


6478 10.txt 


6479 10.txt 


647 10.txt 


6480 10.txt 


6481 10.txt 


6482 10.txt 


6483 10.txt 


6484 10.txt 


6485 10.txt 


6486 10.txt 


8/719 10. 


8/1 9.tx 


8/20 9.t 


8/21 10. 


90122 94% 


8/23 8.t 


8/24 10. 


3/25 10. 


8/26 9.t 


8/2/_7.t 


8/28_7.t 


8/29 10. 


8/2 9.tX 


8/30_8.t 


oT EC 000 


8/32 9.t 


8/33_10. 


8/34_9.t 


8/35_8.t 


8/36_10. 


10987_8.txt 
xt 

10988 10.txt 
xt 

10989 9.txt 
xt 

1098 9.txt 

t 

10990 7.txt 
xt 

10991 9.txt 
txt 

10992 10.txt 
txt 

10993 10.txt 
xt 

10994 8.txt 
xt 

10995 10.txt 
xt 

10996 8.txt 
txt 

10997 10.txt 
xt 

10998 7.txt 
xt 

10999 7.txt 
xt 

1099 10.txt 
t 

109 10.txt 
xt 

10 9.txt 

xt 

11000 10.txt 
xt 

11001 10.txt 
xt 

11002 8.txt 


xt 


198/ 10.txt 


1988 9.txt 


1989 8.txt 


198 8.txt 


1990 10.txt 


1991 10.txt 


1992 10.txt 


1993 10.txt 


1994 10.txt 


1995 9.txt 


1996 10.txt 


1997 9.txt 


1998 9.txt 


1999 9.txt 


199 10.txt 


19 10.txt 


ITED CE 


2000_10.txt 


2001 9.txt 


2002 7/.txt 


4236 8.txt 


4237 10.txt 


4238 9.txt 


4239 10.txt 


423 10.txt 


4240 9.txt 


4241 10.txt 


4242 9.txt 


4243 8.txt 


4244 10.txt 


4245 f.txt 


4246 8.txt 


4247 10.txt 


4248 10.txt 


4249 10.txt 


424 8.txt 


4250 8.txt 


4251 9.txt 


4252 9.txt 


4253 10.txt 


648/ 10.txt 


6488 10.txt 


6489 10.txt 


648 7/.txt 


6490 10.txt 


6491 10.txt 


6492 10.txt 


6493 10.txt 


6494 10.txt 


6495 10.txt 


6496 10.txt 


6497 10.txt 


6498 10.txt 


6499 10.txt 


649 10.txt 


64 7.txt 


6500 10.txt 


6501 10.txt 


6502 10.txt 


6503 10.txt 


8/3/_9.T 


8/38 8.t 


8/39 9.t 


8/3_8.TtX 


8/40_8.t 


8/41 10. 


8/42_10. 


8/43_7.t 


8/44_7.t 


8/45_7.t 


8/46_10. 


8/47_7.t 


8/48_8.t 


8/49 7.t 


8/4 9.tx 


8/50 /.t 


OT d E 


OT o2 9 t 


8/03 8.t 


8/54 8.t 


11003_10.txt 


xt 
11004_8.txt 
xt 

11005 10.txt 
xt 

11006 7.txt 
txt 

11007 10.txt 
xt 

11008 9.txt 
xt 

11009 7.txt 
xt 

1100 7.txt 
txt 

11010 10.txt 
xt 

11011 10.txt 
xt 

11012 9.txt 
xt 

11013 7.txt 
txt 

11014 10.txt 
xt 

11015 9.txt 
txt 

11016 7.txt 
xt 

11017 9.txt 
xt 

11018 10.txt 
t 

11019 10.txt 
xt 

1101 8.txt 
xt 

11020 8.txt 


xt 


2003 8.txt 


2004 10.txt 


2005 10.txt 


2006 /.txt 


200/ 7.txt 


2008 /.txt 


2009 10.txt 


200 10.txt 


2010 8.txt 


2011 /.txt 


2012 8.txt 


2013 8.txt 


2014 /.txt 


2015 8.txt 


2016 /.txt 


201/ 10.txt 


2018 9.txt 


2019 10.txt 


201 10.txt 


2020 7/.txt 


4254 9.txt 


4255 9.txt 


4256 8.txt 


4257 10.txt 


4258 9.txt 


4259 9.txt 


425 10.txt 


4260 9.txt 


4261 9.txt 


4262 10.txt 


4263 8.txt 


4264 10.txt 


4265 8.txt 


4266 7.txt 


426/ 8.txt 


4268 10.txt 


4269 10.txt 


426 /.txt 


42/0 10.txt 


42/1 10.txt 


6504 10.txt 


6505 10.txt 


6506 10.txt 


650/ 10.txt 


6508 /.txt 


6509 9.txt 


650 9.txt 


6510 7/.txt 


6511 10.txt 


6512 /.txt 


6513 9.txt 


6514 10.txt 


6515 /.txt 


6516 10.txt 


651/ 10.txt 


6518 10.txt 


6519 8.txt 


651 10.txt 


6520 7.txt 


6521 /.txt 


9759S FE 


8/56_9.t 


8/5/_8.T 


8/758 10. 


8/59 8.t 


8/5_10.t 


8/60_7.t 


9/61 10. 


8/62 8.t 


8/63 9.t 


8/64 8.t 


3/65 10. 


8/66 8.t 


3/67 10. 


8/68 8.t 


8/69 /.t 


9/6 7 tX 


8//0_7.T 


090. 


8//2_8.T 


11021_8.txt 


xt 
11022_9.txt 
xt 

11023 9.txt 
xt 

11024 9.txt 
txt 

11025 7.txt 
txt 

11026 7.txt 
xt 

11027 10.txt 
xt 

11028 8.txt 
t 

11029 7.txt 
txt 

1102 8.txt 
xt 

11030 8.txt 
xt 

11031 8.txt 
xt 

11032 txt 
xt 

11033 9.txt 
txt 

11034 9.txt 
xt 

11035 8.txt 
xt 

11036 8.txt 
txt 

11037 8.txt 
txt 

11038 10.txt 
t 

11039 8.txt 


xt 


2021 8. 


2022 9. 


2023 f. 


2024 9. 


2025 10.txt 


2026 8. 


2027 10.txt 


2028 10.txt 


2029 8. 


202 10. 


2030 9. 


2031 8. 


2032 10.txt 


2033 8. 


2034 9. 


2035 7. 


2036 7. 


203/ 8. 


2038 /. 


2039 9. 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 
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98/8_7.t 


98/9_8.t 


98/_8.tx 


9880_10. 


9881_8.t 


9882_8.t 


9883_8.t 


9884_10. 


9885 10. 


9886 10. 


988/ 9.t 


9888 10. 


9889 9.t 


988 8.tx 


9890 8.t 


9891 10. 


9892 8.t 


xt 


12141 8.txt 
xt 

12142 9.txt 
xt 

12143 8.txt 
xt 

12144 8.txt 
xt 

12145 10.txt 
txt 

12146 7.txt 
txt 

12147 10.txt 
xt 

12148 10.txt 
t 

12149 10.txt 
t 

1214 10.txt 
txt 

12150 7.txt 
xt 

12151 10.txt 
xt 

12152 9.txt 
xt 

12153 10.txt 
xt 

12154 7.txt 
txt 
1215527: Ext 
txt 

12156 8.txt 
xt 

12157 9.txt 
xt 
12158 7. txt 
xt 


12159 /.txt 


3141 10. 


txt 


3142 8.txt 


3143 10. 


3144 10. 


3145 10. 


3146 10. 


3147 10. 


txt 


txt 


txt 


txt 


txt 


3148 9.txt 


3149 8.txt 


314 10.txt 


3150 10. 


txt 


3151 9.txt 


S152279 EXE 


3153 10. 


3154_10. 


3155_10. 


txt 


txt 


txt 


Sio ene 


315/_10. 


txt 


3158_8.txt 


3159 9.txt 


2392 10.txt 


o393 10.txt 


o394 10.txt 


o395 10.txt 


o396 9. 


So eli She 


9398 8. 


2399 8. 


o39 10. 


txt 


txt 


txt 


txt 


txt 


o3 10.txt 


o400 10.txt 


0401 9. 


50402 9. 


o403 10.txt 


o404 10.txt 


o405 10.txt 


2406 /. 


9407_7. 


2408 7. 


o409 10.txt 


txt 


txt 


txt 


txt 


txt 


1042 9. 


1043 f. 


1044 f. 


1045 f. 


1046 8. 


1047 f. 


1048 f. 


7649 9. 


764 10. 


1650 9. 


1651 8. 


7652 8. 


7653 10 


7654 310. 


/655_10. 


/656_10. 


/657_10. 


/658_10. 


/659_10 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


. txt 


txt 


txt 


txt 


txt 


txt 


txt 


/65 9.txt 


9893 7.t 


9894 8.t 


9895 8.t 


9896 8.t 


989/ 10. 


9898 10. 


SOSOE E 


989 9.tx 


98 10.tx 


9900 10. 


9901 8.t 


9902 7.t 


Ss 


9904 8.t 


9905 10. 


9906 10. 


9907 7.t 


9908 8.t 


9909 7.t 


990 9.tx 


t 


1215_10.txt 
txt 
12160_10.txt 
txt 
12161_7.txt 
txt 
12162_8.txt 
txt 
12163_8.txt 
xt 

12164 9.txt 
xt 

12165 10.txt 
xt 

12166 10.txt 
txt 

12167 10.txt 
txt 

12168 10.txt 
xt 

12169 7.txt 
t 

1216 10.txt 
xt 

12170 8.txt 
xt 

12171 7.txt 
txt 

12172 7.txt 
xt 

12173 8.txt 
xt 

12174 7.txt 
xt 
121775-7.toxt 
xt 

12176 8.txt 
xt 


1217/7 f.txt 


315 10.txt 


3160 7/.txt 


3161 10.txt 


3162 8.txt 


3163 10.txt 


3164 10.txt 


3165 10.txt 


3166 9.txt 


3167 /.txt 


3168 /.txt 


3169 8.txt 


316 10.txt 


31/0 9.txt 


31/1 9.txt 


3142 00. 0 


3173_8.txt 


3174_9.txt 


31/5_8.txt 


31/6 9.txt 


31/7 9.txt 


o40 8.txt 


o410 /.txt 


o411 10.txt 


0412 9.txt 


o413 10.txt 


o414 10.txt 


o415 10.txt 


o416 9.txt 


o41/ 10.txt 


o418 10.txt 


5419 10.txt 


941 7.txt 


0420 /.txt 


0421 /.txt 


0422 9.txt 


0423 9.txt 


0424 10.txt 


9425 10.txt 


o426 10.txt 


5427 10.txt 


/660 10.txt 


/661 10.txt 


/662 10.txt 


/663 7.txt 


/664 10.txt 


/665 7.txt 


/666 7.txt 


/667/_7.txt 


/668 10.txt 


/669 8.txt 


/66 10.txt 


7670 8.txt 


(6/1 10.txt 


7672 10.txt 


(6/93 f.txt 


(6/4 10.txt 


(6/5 8.txt 


7676 9.txt 


/67/7_10.txt 


/67/8_10,txt 


9910_10. 


9911 10. 


chet LA ARG) 


9913 10. 


9914 7.t 


SD (o E 


9916_7.t 


9917_10. 


9918_10. 


SMS) E 


991 7.tx 


9920 7.t 


SZ OE 


9922 10. 


SOS Mole 


9924 9.t 


SLI 


9926_7.t 


992 OE 


9928_7.t 


xt 


12178 7.txt 
xt 

12179 8.txt 
t 

1217 10.txt 
xt 

12180 10.txt 
xt 

12181 8.txt 
xt 

12182 8.txt 
xt 

12183 10.txt 
xt 

12184 10.txt 
xt 

12185 9.txt 
txt 

12186 9.txt 
txt 

12187 9.txt 
xt 

12188 10.txt 
txt 

12189 7.txt 
t 

1218 8.txt 
xt 

12190 7.txt 
xt 

12191 7.txt 
xt 

12192 7.txt 
xt 


12193 10.txt 


xt 


12194 10.txt 


xt 
12195 8.txt 


31/8 8.txt 


31/9 8.txt 


31/ 10.txt 


3180 8.txt 


3181 10.txt 


3182 8.txt 


3183 9.txt 


3184 8.txt 


3185 10.txt 


3186 8.txt 


3187 /.txt 


3188 8.txt 


3189 8.txt 


318 10.txt 


3190 7/.txt 


3191 8.txt 


3192 10.txt 


3193 9.txt 


3194 9.txt 


3195 10. EXE 


o428 10.txt 


9429 10.txt 


942 9.txt 


5430 10.txt 


9431_10.txt 


9432_7/.txt 


9433_10.txt 


9434_9.txt 


9435_10.txt 


9436_7/.txt 


943/_10.txt 


o438 10.txt 


9439_10.txt 


543 10.txt 


o440 10.txt 


o441 10.txt 


0442 8.txt 


o443 8.txt 


o444 8.txt 


o445 10.txt 


16/9 10.txt 


(67 9.txt 


1680 8.txt 


1681 9.txt 


/682_8.txt 


/683 10.txt 


/684_7.txt 


/685 10.txt 


/686 10.txt 


/68/ 8.txt 


/688 10.txt 


/689 10.txt 


/68 9.txt 


1690 9.txt 


/691 9.txt 


1692 7.txt 


7693 9.txt 


1694 10.txt 


/695 9.txt 


/696 8.txt 


SAS at 


992 /.tx 


9930 8.t 


993 Somi 


9932_8.t 


9933_8.t 


9934_8.t 


FOSSE 


9936_10. 


993/_10. 


99368 9. 


9939 10. 


993 8.tx 


9940 9.t 


9941 /.t 


9942 7.t 


9943 7.t 


9944 9.t 


9945 8.t 


9946 9.t 


xt 
12196 10.txt 


txt 

12197 9.txt 
xt 

12198 10.txt 
xt 

12199 9.txt 
t 

1219 10.txt 
xt 

121 10.txt 
xt 

12200 8.txt 
xt 

12201 8.txt 
txt 

12202 9.txt 
xt 

12203 10.txt 
xt 

12204 8.txt 
xt 

12205 8.txt 
xt 

12206 9.txt 
xt 

12207 8.txt 
xt 

12208 8.txt 
t 

12209 7.txt 
xt 

1220 9.txt 
xt 

12210 9.txt 
xt 

12211 10.txt 
txt 


12212 10.txt 


3196 10.txt 


319/_10.txt 


3198_10.txt 


3199_10.txt 


319_9.txt 


31_8.txt 


3200_10.txt 


3201 10.txt 


3202 10.txt 


3203 10.txt 


3204 10.txt 


3205 8.txt 


3206 8.txt 


320/ 7.txt 


3208 /.txt 


3209 8.txt 


320 8.txt 


3210 8.txt 


3211 /.txt 


3212 /.txt 


o446 9.txt 


o447 10.txt 


o448 10.txt 


o449 8.txt 


o44 8.txt 


o450 10.txt 


o451 8.txt 


0452 8.txt 


9453 8.txt 


o454 /.txt 


o455 /.txt 


o456 10.txt 


o45/ 8.txt 


o458 10.txt 


9459 8.txt 


o45 10.txt 


o460 8.txt 


2461 /.txt 


0462 8.txt 


o463 9.txt 


1697 9.txt 


/698 10.txt 


/699 10.txt 


/69 8.txt 


TO f.txt 


/700_10,txt 


/701_10.txt 


/702_10.txt 


7703 8.txt 


/704_10.txt 


/705_7.txt 


/706_7.txt 


/7107_9.txt 


/708_10,txt 


/709_7.txt 


/70_10.txt 


77310 8.txt 


77311 10.txt 


f(12 10.txt 


/713_10.txt 


9947_10. 


9948_8.t 


9949 9.t 


994 7.tx 


9950 8.t 


Siem fut 


9952 8.t 


SS ts Sy ING) 


9954 8.t 


LO E 


SSmo Sra E 


995. IE 


SO SN 


LS fult 


995 Sh TEX 


9960 7.t 


996 TEO St 


9962 9.t 


9963_10. 


9964_10. 


txt 


12213_10.txt 


txt 

12214 10.txt 
txt 
12215_10.txt 
txt 
12216_8.txt 
xt 

12217 10.txt 
txt 

12218 7.txt 
t 

12219 10.txt 
txt 

122] 060 
txt 

12220 8.txt 
txt 

12221 8.txt 
xt 

12222 10.txt 
xt 

12223 10.txt 
txt 

12224 9.txt 
xt 

12225 7.txt 
xt 

12226 8.txt 
xt 

12227 T7.txt 
xt 

12228 8.txt 
t 

12229 9.txt 
xt 

1222 10.txt 
xt 


12230 9.txt 


3213 10.txt 


3214 10.txt 


3215 10.txt 


3216 8.txt 


321/ 8.txt 


3218 9.txt 


3219 10.txt 


321 10.txt 


3220 10.txt 


3221 10.txt 


3222 l.tixt 


3223 Ə- CXT 


3224 f.txt 


3225 10. txt 


3226 10. txt 


3227 f.txt 


3228 /.txt 


3229 /.txt 


322 109. CXT 


3230 8.txt 


o464 /.txt 


o465 8.txt 


o466 8.txt 


o46/ 9.txt 


o468 /.txt 


2469 7/.txt 


o46 10.txt 


o4/0 10.txt 


94/1_8.txt 


94/2_7.txt 


94/3_8.txt 


94/4_9.txt 


94/5_8.txt 


94/6_8.txt 


947/7/_7.txt 


o4/8 9.txt 


o4/9 10.txt 


94/_10.txt 


o480 10.txt 


o481 10.txt 


/7/14_10.txt 


/715_8.txt 


/7116_9.txt 


/7117_10.txt 


/7/18_10.txt 


77 L929 OME 


(/711_7.txt 


((20 8.txt 


(121 7.txt 


1122_10.TXxXt 


/7123_10.txt 


7724 10.txt 


/7125_10.txt 


1126 10.txt 


1121 9.txt 


1128_1.TXT 


1729 7.txt 


/712_10.txt 


/730_7.txt 


/7131 9.txt 


9965_10. 


9966_10. 


996/_10. 


9968_9.t 


9969 10. 


996 9.tx 


9970 10. 


9971] 10. 


9972601 


Serio E 


9974 8.t 


9975 10. 


9976 7.t 


She VEIA 


9978 8.t 


SIS S) WE 


997 7.tx 


9980 8.t 


9981 /.t 


9982 9.t 


xt 
12231 10.txt 


xt 

12232 8.txt 
xt 

12233 8.txt 
xt 

12234 8.txt 
xt 

12235 7.txt 
xt 

12236 10.txt 
xt 

12237 10.txt 
xt 

12238 8.txt 
t 

12239 10.txt 
xt 

1223 7.txt 
txt 

12240 8.txt 
txt 

12241 10.txt 
txt 

12242 10.txt 
txt 

12243 8.txt 
txt 

12244 7.txt 
xt 

12245 10.txt 
xt 

12246 9.txt 
xt 

12247 8.txt 
xt 

12248 7.txt 
xt 


12249 8.txt 


3231 10.txt 


3232 9. 


3233_10.txt 


3234 9.txt 


3235 9.txt 


3236 /.txt 


323/ 8.txt 


3238 9. txt 


3239 10. txt 


323 10. txt 


3240 10. txt 


3241 8. txt 


3242 8. txt 


3243 8. txt 


3244 10. txt 


3245 10. txt 


3246 9. txt 


3247 10. txt 


3248 10. txt 


3249 9. txt 


5482 10. 


9483 10. 


9484 10. 


94805 10. 


9486 10. 


o48/ 10. 


o488 10. 


o489 10. 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


txt 


o48 /.txt 


o490 9.txt 


0491 10. 


0492 10. 


0493 10. 


o494 10. 


txt 


txt 


txt 


txt 


o495 8.txt 


o496 9.txt 


5497 9.txt 


o498 /.txt 


5499 10. 


txt 


549 9.txt 


((32 8.txt 


WS SEO 


7734_10.txt 


/735_10.txt 


7736 10.txt 


((3/ 10.txt 


7/38 8.txt 


1139 10.txt 


((3 ftxt 


/740_10.txt 


7741 9 EX 


/7142_8.txt 


/743_8.txt 


/744_10.txt 


/745_8.txt 


/746_10.txt 


/7147_10.txt 


/748_9.txt 


/749 8.txt 


/714_8.tXt 


9983_7.t 


9984_9.t 


9985_9.t 


9986_9.t 


998/_9.t 


9988_8.t 


9989 9.t 


998 /.tx 


9990 8.t 


9991 10. 


9992 10. 


9993 10. 


9994 10. 


22 - di) - 


9996 9.t 


She Yl 


SS a E 


Sa E 


999 10.t 


99_8.txt 


1224 9.txt 324 8.txt o4 10.txt /750_8.txt 9_7.txt 
cat O 9.txt 


Bromwell High is a cartoon comedy. It ran at the same time as so 
me other programs about school life, such as "Teachers". My 35 y 
ears in the teaching profession lead me to believe that Bromwell 
High's satire is much closer to reality than is "Teachers". The 
scramble to survive financially, the insightful students who ca 
n see right through their pathetic teachers' pomp, the pettiness 
of the whole situation, all remind me of the schools I knew and 
their students. When I saw the episode in which a student repea 
tedly tried to burn down the school, I immediately recalled 
Tq at .......... High. A classic line: INSPECTOR: I'm here to 
sack one of your teachers. STUDENT: Welcome to Bromwell High. I 
expect that many adults of my age think that Bromwell High is f 
ar fetched. What a pity that it isn't! 


import pyprind 
import pandas as pd 
import os 


labels = {'pos':1, 'neg':0) 
pbar = pyprind.ProgBar(50000) 
df = pd.DataFrame( ) 
for s in ('test', 'train'): 
for 1 in ('pos', 'neg'): 
path ='data/aclImdb/%s/%s' 96 (s, 1) 
for file in os.listdir(path): 
with open(os.path.join(path, file), 'r') as infile: 
txt - infile.read() 
df = df.append([[txt, labels[1]]], ignore index=True 


pbar.update( ) 
df.columns = ['review', 'sentiment'] 


0% 100% 
| VHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH] | ETA: 00:00:00 
Total time elapsed: 00:01:41 


df.head(3) 
review sentiment 
O — I| went and saw this movie last night after bei... 1 
1 Actor turned director Bill Paxton follows up h... 1 
2 Asarecreational golfer with some knowledge o... 1 


Shuffling the DataFrame: 


import numpy as np 
np.random.seed(0) 
df = df.reindex(np.random.permutation(df.index)) 


df.head(3) 
review sentiment 
11841 In 1974, the teenager Martha Moxley (Maggie Gr... 1 
19602 OK... so... | really like Kris Kristofferson a... 0 
45519 ***SPOILER*** Do not read this, if you think a... 0 


Optional: Saving the assembled data as CSV file: 


df.to csv('./data/movie data.csv', index-False, encoding-'utf-8' 


) 


import pandas as pd 
df = pd.read csv('./data/movie data.csv', encoding-'utf-8') 
df.head(3) 


review sentiment 


O In 1974, the teenager Martha Moxley (Maggie Gr... 1 
1 OK... so... | really like Kris Kristofferson a... 0 
2 **SPOILER** Do not read this, if you think a... 0 


Text feature extraction 


[back to top] 


Bag-of-words model 


[back to top] 


Free text with variables length is very far from the fixed length numeric 
representation that we need to do machine learning with scikit-learn. 

However, there is an easy and effective way to go from text data to a numeric 
representation using the so-called bag-of-words model, which provides a data 
structure that is compatible with the machine learning aglorithms in scikit-learn. 


import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer 
count = CountVectorizer() 
docs = np.array([ 

"The sun is shining', 

'The weather is sweet', 

"The sun is shining, the weather is sweet, and one and o 
ne is two']) 
bag - count.fit transform(docs) 


count .vocabulary_ 


{u'and': 
U 1S > 
u'one': 


O, 
1, 
2, 


u'shining': 3, 


u'sun!: 


4, 


u'sweet': 


u'the': 


u'two': 


6, 
l, 


2, 


u'weather': 8) 
count.get feature names() 


[u'and', 
u'is', 
u'one', 
u'shining', 
u'sun', 
u'sweet', 
u'the', 
u'two', 
u'weather' | 


As we can see from executing the preceding command, the vocabulary is stored 
in a Python dictionary, which maps the unique words that are mapped to integer 
indices. Next let us print the feature vectors that we just created: 


bag.toarray( ) 


array([[0, 1, O, 1, 1, 0, 1, 0, 0], 
[0, 1, 0, 0, O, 1, 1, O, IR] 
[2 3, 2, 1, 1, 1, 2, 1, 111) 


count. inverse transform(bag) 


[array([u'the', u'sun', u'is', u'shining' ], 
dtype='<U7'), array([u'the', u'is', u'weather', u'sweet' ] 


dtype='<U7'), array([u'the', u'sun', u'is', u'shining', u 
'weather', u'sweet', u'and', 
u'one', u'two'], 
dtype='<U7')] 


Bigrams and N-Grams 


[back to top] 


In last section, we used the so-called 1-gram (unigram) tokenization: Each token 
represents a single element with regard to the splittling criterion. 


Entirely discarding word order is not always a good idea, as composite phrases 
often have specific meaning, and modifiers like "not" can invert the meaning of 
words. 


A simple way to include some word order are n-grams, which don't only look at a 
single token, but at all pairs of neighborhing tokens. For example, in 2-gram 
(bigram) tokenization, we would group words together with an overlap of one 
word; in 3-gram (trigram) splits we would create an overlap two words, and so 
forth: 


e Original text: "this is how you get ants" 


e 1-gram: "this", "IS", "how", "you", "get", "ants" 


e 2-gram: "this is", "is how", "how you", "you get", "get ants" 
e 3-gram: "this is how", "is how you", "how you get", "you get ants" 


Which "n" we choose for "n-gram" tokenization to obtain the optimal performance 
in our predictive model depends on the learning algorithm, dataset, and task. Or in 
other words, we have consider "n" in "n-grams" as a tuning parameters. 


The CountVectorizer class in scikit-learn allows us to use different n-gram models 
via its ngram_range parameter. While a 1-gram representation is used by default, 
we could switch to a 2-gram representation by initializing a new CountVectorizer 
instance with ngram_range=(2,2) 


bigram_vectorizer = CountVectorizer(ngram_range=(2,2)) 
bigram_vectorizer.fit_transform(docs).toarray() 


bigram vectorizer.vocabulary 


fu'and one': 90, 
u'is shining': 1, 
u'is sweet': 2, 
u'is two': 3, 
u'one and': 4, 
u'one is': 5, 
u'shining the': 6, 
u'sun 1s': 7, 
u'sweet and': 8, 
u'the sun': 9, 
u'the weather': 10, 
u'weather is': 11) 


Character n-grams 


[back to top] 


Sometimes it is also helpful not only to look at words, but to consider single 
characters instead. 

That is particularly useful if we have very noisy data and want to identify the 
language, or if we want to predict something about a single word. We can simply 
look at characters instead of words by setting analyzer="char" . 


X = ['Some say the world will end in fire,', ‘Some say in ice.'] 


char vectorizer = CountVectorizer(analyzer="char") 
char_vectorizer.fit(X) 


CountVectorizer(analyzer='char', binary=False, decode_error=u'st 
rict', 
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input-u'c 
ontent', 
lowercase=True, max_df=1.0, max_features=None, min_df=1, 
ngram_range=(1, 1), preprocessor=None, stop_words=None, 
strip_accents=None, token pattern=u'(?u)\\b\\w\\wt\\b', 
tokenizer=None, vocabulary=None) 


print(char vectorizer.get feature names()) 


Tfidf encoding 


[back to top] 


np.set printoptions(precision-2) 


When we are analyzing text data, we often encounter words that occur across 
multiple documents from both classes. Those frequently occurring words typically 
don't contain useful or discriminatory information. In this subsection, we will learn 
about a useful technique called term frequency-inverse document frequency (tf- 
idf) that can be used to downweight those frequently occurring words in the 
feature vectors. The tf-idf can be de ned as the product of the term frequency and 
the inverse document frequency: 


tf-idf(t,d) = tf (t,d) x idf(t, d) 

Here the tf(t, d) is the term frequency that we introduced in the previous section, 
and the inverse document frequency idf(t, d) can be calculated as: 

idf(t, d) = log 


ri 
2 1+df((d.t) * 


where "ti is the total number of documents, and df(d, t) is the number of 
documents d that contain the term t. Note that adding the constant 1 to the 
denominator is optional and serves the purpose of assigning a non-zero value to 
terms that occur in all training samples; the log is used to ensure that low 
document frequencies are not given too much weight. 


Scikit-learn implements yet another transformer, the TfidfTransformer , that 
takes the raw term frequencies from CountVectorizer as input and transforms 
them into tf-idfs: 


from sklearn.feature extraction.text import TfidfTransformer 


tfidf = TfidfTransformer (use idf=True, norm='12', smooth idf=True 


) 


print(tfidf.fit transform(count.fit transform(docs)).toarray()) 


加 


[[ 0. 0.43 0. 0.56 0.56 0. 0.43 0. 0. | 
[ 0. 0.43 0. O. O. 0.56 0.43 0. 0.56] 
[ 0.5 0.45 0.5 0.19 0.19 0.19 0.3 0.25 0.19]] 


As we saw in the previous subsection, the word "is" (column 2) had the largest 
term frequency in the 3rd document, being the most frequently occurring word. 
However, after transforming the same feature vector into tf-idfs, we see that the 
word "is" is now associated with a relatively small tf-idf (0.45) in document 3 since 
it is also contained in documents 1 and 2 and thus is unlikely to contain any 
useful, discriminatory information. 


However, if we'd manually calculated the tf-idfs of the individual terms in our 
feature vectors, we'd have noticed that the TfidfTransformer calculates the tf- 
idfs slightly differently compared to the standard textbook equations that we see 
earlier. The equations for the idf and tf-idf that were implemented in scikit-learn 
are: 


idf(t, d) = log —— 


| 4-dfi d.) 
The tf-idf equation that was implemented in scikit-learn is as follows: 
tf-idf(t,d) = tf(£, d) x (1df(t, d) + 1) 


While it is also more typical to normalize the raw term frequencies before 
calculating the tf-idfs, the TfidfTransformer normalizes the tf-idfs directly. 


By default ( norm='12' ), scikit-learn's TfidfTransformer applies the L2- 
normalization, which returns a vector of length 1 by dividing an un-normalized 
feature vector v by its L2-norm: 


Unorm : i 

Db t toui fn "AT 

V1 2 Besa 
C J 


| i | 


To make sure that we understand how TfidfTransformer works, let us walk through 
an example and calculate the tf-idf of the word is in the 3rd document. 


The word is has a term frequency of 3 (tf = 3) in document 3, and the document 
frequency of this term is 3 since the term is occurs in all three documents (df = 3). 
Thus, we can calculate the idf as follows: 


idf(” is ",d3) = log = 0 


14-3 





Now in order to calculate the tf-idf, we simply need to add 1 to the inverse 
document frequency and multiply it by the term frequency: 


tf-idf(" is ",d3) =3x(0+1)=3 


tf_is = 3 

n_docs = 3 

idf is = np.log((n docs+1) / (3+1)) 

tfidf is = tf is * (idf is + 1) 

print('tf-idf of term "is" - %.2f' % tfidf is) 


tf-idf of term "is" = 3.00 


If we repeated these calculations for all terms in the 3rd document, we'd obtain 
the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. 
However, we notice that the values in this feature vector are different from the 
values that we obtained from the TfidfTransformer that we used previously. The 
step that we are missing in this tf-idf calculation is the L2-normalization, which can 


be applied as follows: 


3.39,3.0,3.39,1.29,1.29,1.29,2.0,1.69,1.29 


Aa em 














\ 3.394 3.05 3,39% 1,20% 1,20% 1.29* 2.0% 1.69% 1,29° 
0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 0.19] 
> tfi-d£, uu (" is ”, d3) = 0.45 


tfidf = TfidfTransformer (use idf=True, normzNone, smooth idf=True 


) 


raw tfidf - tfidf.fit transform(count.fit transform(docs)).toarr 


ay()[-1] 
raw_tfidf 


EEEF] 


array([ 3.39, 3. , 3.39, 1.29, 1.29, 1.29, 2. , 1.69, 
1.29]) 


12 tfidf = raw tfidf / np.sqrt(np.sum(raw tfidf**2)) 
12 tfidf 


array([ 0.5, 0.45, 0.5, 0.19, 0.19, 0.19, 0.3, 0.25, 
0.19]) 


As we can see, the results match the results returned by scikit-learn's 


TfidfTransformer (below). Since we now understand how tf-idfs are 


calculated, let us proceed to the next sections and apply those concepts to the 


movie review dataset. 


Cleaning text data 
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df.loc[O, 'review'][-50:] 


u'is seven.«br /»«br /»Title (Brazil): Not Available' 


text contains html markup tags, we need clean them. we will now remove all 


punctuation marks but only keep emoticon characters such as ":)" 


import re 


def 


2ssor(text): 
text = re.sub(r'<[4>]*>', '', text) 去 除 | 4K 
emoticons = re.findall(r'(?::|];|=)(?:-)?(?:\)|[\(|[D|[P)', text 


text = re.sub(r'\Wt', ' ', text.lower()) + \ 
' '.Join(emoticons).replace('-', '') 


return text 


preprocessor(df.loc[O, 'review'][-50:]) 


u'is seven title brazil not available' 
preprocessor("</a>This :) is :( a test :-)!") 
‘this is a test :) :( :)' 


df['review'] = df['review'].apply(preprocessor) 


Processing documents into tokens 
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split the text corpora into individual elements. 
tokenize 42K XL AW jx — AF BA] > xx 8] BAY AV whitespace 5 >. 


word stemming 是 将 词 转 为 最 原始 的 形式 , root form, (例如 running -> run), 一 种 算 
¡AE Porter stemmer algorithm 


以 下 需要 使 用 nltk, 需要 先 安 装 : pip install nltk 
from nltk.stem.porter import PorterStemmer 
porter = PorterStemmer( ) 
def tokenizer(text): 

return text.split() 


def tokeni _porter(text): 
return [porter.stem(word) for word in text.split()] 


tokenizer('runners like running and thus they run') 


['runners', 'like', 'running', 'and', 'thus', 'they', 'run'] 


tokenizer porter('runners like running and thus they run') 


[u'runner', u'like', u'run', u'and', u'thu', u'they', u'run' | 


Stop-words (435) 是 "t 它们 的 实际 意义 并 不 是 很 大 , 大 多 是 起 畏 
助 作 用 的 , 但 是 它们 的 频次 非常 高 , 所 以 需要 去 除 , 例如 is, and, has 


import nltk 
nltk.download('stopwords') +4 TAIT 


[nltk_data] Downloading package stopwords to /Users/alan/nltk_da 
Lars 
[nltk_data] Package stopwords is already up-to-date! 


True 


from nltk.corpus import stopwords 

stop = stopwords.words( english') 

[w for w in tokenizer porter('a runner likes running and runs a 
lot')[-10:] if w not in stop] 


[u'runner', u'like', u'run', u'run', u'lot'] 


Training a logistic regression model for 
sentiment classification 
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Strip HTML and punctuation to speed up the GridSearch later: 


X train = df.loc[:5000, 'review'].values 

y train - df.loc[:5000, 'sentiment'].values 
X test - df.loc[5000:, 'review'].values 

y test - df.loc[5000:, 'sentiment'].values 


from 
from 
from 
from 


sklearn 
sklearn 
sklearn 
sklearn 


«grid search import GridSearchCV 

.pipeline import Pipeline 

.linear_model import LogisticRegression 
.feature_extraction.text import TfidfVectorizer 


# TfidfVectorizer SFT CountVectorizer + TfidfTransformer 
tfidf = TfidfVectorizer(strip_accents=None, 


lowercase=False, 
preprocessor=None) 


# grig search 


param grid = [{'vect__ngram_range': [(1,1)], 


'vect stop words': [stop, None], 

'vect tokenizer': [tokenizer, tokenizer porter], 

'clf__ penalty": ['11', '12'], 

Mein OO Oreo [aea A rra 
('vect ngram range': [(1,1)], 

'vect stop words': [stop, None], 

'vect tokenizer': [tokenizer, tokenizer porter], 

'vect use idf':[False], 

'vect norm':[None], 

Clf penalty': ['11', '12'], 

CIC | OOOO AO 


H 先 转化 为 tfidf matrix 
lr tfidf = Pipeline([('vect', tfidf), 


1) 


('clf', LogisticRegression(random_state=0)) 


gs_1r_tfidf = GridSearchCV(lr tfidf, param grid, 


scoring-'accuracy', 
cv=5, verbose=1, 
n jobs--1) 


# 数据 量 减 少 为 5000 后 需要 时 间 大 约 为 8 分 钟 
gs_lr_tfidf.fit(X_train, y_train) 


Fitting 5 folds for each of 48 candidates, totalling 240 fits 


[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.4min 
[Parallel(n jobs--1)]: Done 184 tasks | elapsed: 8.0min 
[Parallel(n jobs--1)]: Done 240 out of 240 | elapsed: 10.5min fi 
nished 


GridSearchCV(cv=5, error score-z'raise', 

estimator-Pipeline(steps-[('vect', TfidfVectorizer(analyz 
er=u'word', binary-False, decode errorzu'strict', 

dtype=<type 'numpy.int64'>, encoding-u'utf-8', input-u'c 
ontent', 

lowercase-False, max df-1.0, max features-None, min df=1 


ngram range=(1, 1), norm=u'12', preprocessor=None, smoot 
h_idf=Tru...nalty='12', random state-0, solver='liblinear', tol- 
0.0001, 
verbose=0, warm_start=False))]), 
fit_params={}, iid-True, n jobs--1, 
param grid=[(f'vect ngram range': [(1, 1)], 'vect tokeni 
zer': [<function tokenizer at 0x130dca410>, «function tokenizer 
porter at 0x130dca488>], 'clf penalty': ['11', '12'], 'clf C': 
[1.0, 10.0, 100.0], 'vect stop words': [[u'i', u'me', u'my', u 
'myself', u'we', u'our', u'ours', u'ourselves', u'y...x130dca488 
>|, 'vect use idf': [False], 'clf C': [1.0, 10.0, 100.0], 'clf 
_ penalty': ['11', '12']}], 
pre dispatch='2*n jobs', refit=True, scoring='accuracy', 
verbose=1) 


print('Best parameter set: %s ' % gs lr tfidf.best params ) 
print('CV Accuracy: %.3f' 96 gs lr tfidf.best score ) 


Best parameter set: ('vect ngram range': (1, 1), 'vect__tokeniz 
er': «function tokenizer at 0x130dca410>, 'clf penalty': '12', 
clf C': 10.0, 'vect stop words': [u'i', u'me', u'my', u'mysel 


f', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'you 


rs', u'yourself', u'yourselves', u'he', u'him', u'his', u'himsel 


f', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself 


', u'they', u'them', u'their', u'theirs', u'themselves', u'what' 


u'which', u'who', u'whom', u'this', u'that', u'these', u'those 


', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'bein 


g', u'have', u'has', u'had', u'having', u'do', u'does', u'did', 


u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', 


u'b 


ecause', u'as', u'until', u'while', u'of', u'at', u'by', u'for', 


u'with', u'about', u'against', u'between', u'into', u'through', 


u'during', u'before', u'after', u'above', u'below', u'to', u'fr 


om', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'un 


der', u'again', u'further', u'then', u'once', u'here', u'there', 


u'when', u'where', u'why', u'how', u'all', u'any', u'both', 


u'e 


ach', u'few', u'more', u'most', u'other', u'some', u'such', u'no 


‘, u'nor', u'not', u'only', u'own', u'same', u'so', u'than', 


u't 


oo', u'very', u's', u't', u'can', u'will', u'just', u'don', u'sh 


ould', u'now', u'd', u'11', u'm', u'o', u're', u've', u'y', u'al 


n', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', 
aven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', 
shouldn', u'wasn', u'weren', u'won', u'wouldn' |} 

CV Accuracy: 0.862 


clf = gs lr tfidf.best estimator . 
print('Test Accuracy: %.3f' 96 clf.score(X test, y test)) 


Test Accuracy: 0.873 


Working with bigger data - online 


u'h 
u I 


algorithms and out-of-core learning 
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Out-of-Core learning is the task of training a machine learning model on a dataset 
that does not fit into memory or RAM. This requires the following conditions: 


e afeature extraction layer with fixed output dimensionality 
e knowing the list of all classes in advance (in this case we only have positive 
and negative tweets) 
e a machine learning algorithm that supports incremental learning (the 
partial_fit method in scikit-learn). 


import numpy as np 
import re 
from nltk.corpus import stopwords 


stop = stopwords.words( english') 


def (text): 

text = re.sub('«[^»]*»', '', text) 

emoticons = re.findall('(?::|];|=)(?:-)?(?:\)|[\(|[D|[P)', text. 
lower()) 

text = re.sub('[WW|+', ' ', text.lower()) + ' '.join(emotico 
ns).replace('-', '') 

tokenized = [w for w in text.split() if w not in stop] 

return tokenized 


def | (path): 


with open(path, 'r') as csv: 
next(csv) 
for line in csv: 
text, label = line[:-3], int(line[-2]) 
yield text, label 


gen = stream docs(path='./data/movie data.csv') 
next(gen) 


('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to t 
he high-class area of Belle Haven, Greenwich, Connecticut. On th 
e Mischief Night, eve of Halloween, she was murdered in the back 
yard of her house and her murder remained unsolved. Twenty-two y 
ears later, the writer Mark Fuhrman (Christopher Meloni), who is 
a former LA detective that has fallen in disgrace for perjury i 
n 0.J. Simpson trial and moved to Idaho, decides to investigate 
the case with his partner Stephen Weeks (Andrew Mitchell) with t 
he purpose of writing a book. The locals squirm and do not welco 
me them, but with the support of the retired detective Steve Car 
roll (Robert Forster) that was in charge of the investigation in 
the 70\'s, they discover the criminal and a net of power and mo 
ney to cover the murder.<br /><br />""Murder in Greenwich"" is a 
good TV movie, with the true story of a murder of a fifteen yea 
rs old girl that was committed by a wealthy teenager whose mothe 
r was a Kennedy. The powerful and rich family used their influen 
ce to cover the murder for more than twenty years. However, a sn 
oopy detective and convicted perjurer in disgrace was able to di 
sclose how the hideous crime was committed. The screenplay shows 
the investigation of Mark and the last days of Martha in parall 
el, but there is a lack of the emotion in the dramatization. My 
vote is seven.<br /><br />Title (Brazil): Not Available"', 
1) 


def get_minibatch(doc_stream, size): 
take a document stream from the stream_docs function and 
return a particular number of documents 


docs, y = [], [] 
try: 
for _ in range(size): 
text, label = next(doc_stream) 
docs.append(text) 
y.append(label) 
except StopIteration: 
return None, None 
return docs, y 


from sklearn.feature extraction.text import HashingVectorizer # 
makes use of the Hashing trick 

from sklearn.linear_model import SGDClassifier # train a logist 
ic regression model using small minibatches of documents 


vect = HashingVectorizer(decode_error='ignore', 
n_features=2**21, 
preprocessor=None, 
tokenizer=tokenizer) 


clf = SGDClassifier(loss='log', random_state=1, n_iter=1) 
doc stream = stream docs(path='./data/movie data.csv') 


# iterated over 45 minibatches of documents 

# where each minibatch consists of 1,000 documents each 
import pyprind 

pbar = pyprind.ProgBar(45) 


classes = np.array([0, 1]) 

for | in range(45): 
X train, y train - get minibatch(doc stream, size=1000) 
if not X train: 

break 

X train - vect.transform(X train) 
clf.partial fit(X train, y train, classes-classes) 
pbar.update( ) 


0% 100% 
| VHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH] | ETA: 00:00:00 
Total time elapsed: 00:00:40 


#use the last 5,000 documents to evaluate the performance 
X test, y test = get minibatch(doc stream, size=5000) 

X test - vect.transform(X test) 

print('Accuracy: %.3f' 96 clf.score(X test, y test)) 


Accuracy: 0.867 


虽然 准确 率 略 低 于 前 面 ， 但 训练 速度 快 了 很 多 ， 而 且 使 用 的 内 存 更 少 


# use the last 5,000 documents to update our model 
# 可 以 使 用 partial fit 继续 训练 
clf = clf.partial_fit(X_test, y_test) 


Model persistence 


训练 模型 是 expensive 并 且 耗 费时 间 的 ,我们 不 硕 望 在 应 用 中 每 次 都 要 重新 训练 模 
型 ,所 以 我 们 需要 保存 模型 , 并 且 能 进行 新 的 预测 以 及 更 新 。 可 以 用 到 pickle 模块 
来 储存 模型 , 将 python object të & 73 byte code, 可 以 读 取 也 可 以 写 入 
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After we trained the logistic regression model as shown above, we can save the 
classifier along with the stop words, Porter Stemmer, and HashingVectorizer 
as serialized objects to our local disk so that we can use the fitted classifier in our 
web application later. 


import pickle 
import os 


dest = os.path.join('movieclassifier', 'pkl objects') 
if not os.path.exists(dest): 
os.makedirs(dest) 


pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb' 
), protocol-2) 
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb' 
), protocol-2) 


Next, we save the HashingVectorizer as in a separate file so that we can 
import it later. 


%writefile movieclassifier/vectorizer. .py 

from sklearn.feature_extraction.text import HashingVectorizer 
import re 

import os 

import pickle 


cur dir = os.path.dirname( file ) 

stop - pickle.load(open( 
os.path.join(cur dir, 
"pkl objects', 
"stopwords.pkl'), 'rb')) 


def tokenizer(text): 
text = re.sub('«[^»]*»', '', text) 
emoticons = re.findall('(?::|;|2)(?:-)?2(?:N) | NC [DIP)', 
text.lower()) 
text = re.sub('[\W]+', ' ', text.lower()) À 
+ ' ',join(emoticons).replace('-', '') 
tokenized = [w for w in text.split() if w not in stop] 
return tokenized 


vect = HashingVectorizer(decode_error='ignore', 
n_features=2**21, 
preprocessor=None, 
tokenizer=tokenizer) 


Overwriting movieclassifier/vectorizer.py 


After executing the preceeding code cells, we can now restart the IPython 
notebook kernel to check if the objects were serialized correctly. 


First, change the current Python directory to movieclassifer : 


import os 
os.chdir('movieclassifier') 


import pickle 

import re 

import os 

from vectorizer import vect 


clf = pickle.load(open(os.path.join('pkl objects', 'classifier.p 
kL'), 'rb')) 


import numpy as np 
label = (0: negative', 1:'positive'} 


example = ['I love this movie | 

X = vect.transform(example) 

print('Prediction: %sinProbability: %.2f%%' %\ 
(label[clf.predict(X)[0]], np.max(clf.predict proba(X))*100 


e BE 


Prediction: positive 
Probability: 82.53% 


word2vec 
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word2vec x Mikolov et al. 提出 一 种 训练 词 向 量 的 方法 。 它 有 Continuous Bag-of- 
Words model (CBOW) and the Skip-Gram model 两 种 变 式 ， 前 者 是 用 一 个 词 序列 
窗口 中 的 其 他 词 来 预测 中 心 词 ， 后 者 则 是 用 中 心 词 来 预测 其 他 词 。 在 实际 使 用 
word2vec 时 ， 一 般 使 用 Skip-Gram model 结合 Negative Sampling 进行 训练 。 


INPUT PROJECTION OUTPUT INPUT PROJECTION OUTPUT 


wit) 





w(t*2) 





CBOW Skip-gram 


# gensim 库 中 包含 了 word2vec 模块 
from gensim.models.word2vec import Word2Vec 


训练 词 向 量 ，gensim 0141 E + ARA — + iterator 


class MySentence: 
def _ init (self, data): 
self.data = data 


def — iter (self): 
for line in self.data: 
yield line.lower().split() 


train corpus = MySentence(df['review']) 
& Xd train corpus = [s.split() for s in df['review']] 
model = Word2Vec(train corpus, 

size-200, # 词 向 量 的 维度 


iter-20, # 数据 在 训练 中 用 到 的 次 数 ， 即 epoch & 
workers=4) +4 调用 的 进程 数 


简单 查看 词 向 量 的 结果 


model.most_similar('good') 


[(u'decent', 0.7367605566978455), 


(u'bad', 0.7155213356018066), 

(u'great', 0.7067341804504395), 
(u'nice', 0.6252745389938354), 
(u'cool', 0.5865251421928406), 
(u'passable', 0.5830472707748413), 
(u'funny', 0.5790721774101257), 
(u'fine', 0.5767843127250671), 
(u'lousy', 0.5747220516204834), 
(u'terrible', 0.5679782629013062)] 


获得 词 对 应 的 词 向 量 


model[ good ] 


array([-1 
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269369, 


139468, 
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95312113, 


96748334, 


22498131, 


. 953603202, 


6024124 , 


26284862, 


95936358, 


59281707, 


67054856, 


69413838, 


203692], dtype-float32) 


存储 / 读 取 模型 


model.save('data/imdb.d2v') 


.46425474, 


.23242712, 


, 68403792, 


. 23248808, 


.08097696, 


.00253165, 


. 06030715, 


. 24002089, 


, 69943428, 


. 21081388, 


53806341, 


model = Word2Vec.load( 'data/imdb.d2v' ) 


利用 训练 好 的 词 向 量 来 进行 情感 分 析 


.9658314 , 0. 
08304592, -1. 
61612749, -0 
79292464, 1. 
61062503, -1 
25757971, 3. 
.38174224, -2 
.7718761 , 1. 
.34991279, -0 
.81272793, 9. 


.95830393, 2. 


40 


21 


39 


.45 


58 


T99 


28 


-92 


64 


38 


def get_doc_vec(sentence, model): 


scores = [model[word] for word in sentence.split() 
if word in model] WT min c 


return np.mean(scores, axis=0) 


X word2vec train - np.array([get doc vec(sentence, model) for se 
ntence in X train]) 

X word2vec test - np.array([get doc vec(sentence, model) for sen 
tence in X test]) 


from sklearn.grid search import GridSearchCV 
from sklearn.linear model import LogisticRegression 


lr = LogisticRegression(random_state=0) 


param grid = [{'penalty': ['11', '12'], 
'C': [1.0, 10.0, 100.0]}] 


gs = GridSearchCV(lr, param grid, 
scoring-'accuracy', 
cv=5, verbose=1, 


n_jobs=-1) 


gs.fit(X word2vec train, y train); 


Fitting 5 folds for each of 6 candidates, totalling 30 fits 


[Parallel(n jobs--1)]: Done 30 out of 30 | elapsed: 14.5s fi 
nished 


gs.best params . 


{'C': 1.0, 'penalty': '12') 
gs.best_score_ 
0.856428/1425/14862 


clf = gs.best_estimator_ 
print('Test Accuracy: %.3f' % clf.score(X_word2vec_test, y_test) 


) 


Test Accuracy: 0.869 


中 文 新 闻 分 类 


from os import path 
import os 
import re 
import pandas as pd 
import numpy as np 


rootdir = 'data/SogouC.reduced/Reduced' 

dirs = os.listdir(rootdir) 

dirs = [path.join(rootdir,f) for f in dirs if f.startswith('C')] 
dirs 


| 'data/SogouC.reduced/Reduced/C000008', 
data/SogouC.reduced/Reduced/C000010', 
data/SogouC.reduced/Reduced/C000013', 
data/SogouC.reduced/Reduced/C000014', 
data/SogouC.reduced/Reduced/C000016', 
data/SogouC.reduced/Reduced/C000020', 
data/SogouC.reduced/Reduced/C000022', 
data/SogouC.reduced/Reduced/C000023', 
data/SogouC.reduced/Reduced/C000024' | 


def load txt(x): 
with open(x) as f: 
res = [t.decode('gbk', 'ignore') for t in f] 
return ''.join(res) 


text t = {} 
for 1, d in enumerate(dirs): 
files = os.listdir(d) 
files = [path.join(d, x) for x in files if x.endswith('txt') 


and not x.startswith('.')] 
text t[i] = [load txt(f) for f in files] 


[S E 


print(text_t[0][0][:100]) 


本 报 记 者 陈 雪 频 实习 记者 唐 翔 发 自 上 海 


一 家 刚刚 成 立 两 年 的 网 络 支 付 公司 ， 它 的 目标 是 成 为 市 值 109 亿 美元 的 上 市 公司 。 
这 家 公司 叫做 快 钱 ， 说 这 句 话 的 是 快 钱 的 CE0 关 国光 。 他 之 前 曾 任 网 荔 的 高 级 副 


flen = [len(t) for t in text t.values()] 
labels = np.repeat(text t.keys(),flen) 


import itertools 
merged - list(itertools.chain.from iterable(text t.values())) 


df = pd.DataFrame({'label': labels, 'txt': merged)) 
df .head( ) 


label txt 
0 0 本 报 记 者 陈 雪 频 实 习 记 者 唐 翔 发 自 上 海 \r\n 一 家 刚刚 成 立 两 年 
的 网 络 支付 公司 ， 它 的 目标 有 是... 
ilo 证 券 通 : 百联 股份 未 来 5 年 有 能 力 保持 高 速 增长 \r\n\r\n 深度 报告 权 
2 0 5 月 09 日 消息 快 评 \r\n\rin 深度 报告 权威 内 参 来 自 “ 证 券 通 "WWW.... 
3 0 5 月 09 日 消息 快 评 \r\n\r\in 深度 报告 权威 内 参 RETIRA www... 
4 0 5 月 09 日 消息 快 评 \r\n\rin 深度 报告 权威 内 参 来 自 “ 证 券 通 "WWW.... 


import jieba 
jieba.enable parallel(4) 
def « (X): 
words - jieba.cut(x) 
return ' '.join(words) 


Building prefix dict from the default dictionary ... 

Loading model from cache /var/folders/y8/z6ws1f2907vbb33mcp8c633 
00000gn/T/jieba.cache 

Loading model cost 1.119 seconds. 

Prefix dict has been built succesfully. 


df['seg_word'] = df.txt.map(cutword_1) 


df .head( ) 


label txt seg_word 


A A 本 报 记 者 REM 实习 iz 


0 0 海 \r\n 一 家 刚刚 成 立 两 年 的 网 络 者 È KR 上 海 An 
支付 公司 ， 它 的 目标 是 ... 一 家 刚刚 RI... 
证 券 通 : 百联 股份 未 来 5 年 有 能 力 保 证 券 通 : 百联 股份 未 来 

1 0 持 高 速 增长 \r\n\r\in 深度 报告 权威 内 5 年 有 能 力 保持 高 速 增 
参 ... En... 

2 0 5 月 09 日 消息 快 评 \r\n\r\in 深度 报告 5 月 09 日 消息 快 评 \r\n 
权威 内 参 来 自 “ 证 券 通 "WWW.... Win 深度 RE... 

3 0 5 月 09 日 消息 快 评 \r\n\r\in 深度 报告 5 H 09 日 消息 HE Wn 
权威 内 参 来 自 “ 证 券 通 "WWW.... Win 深度 dE... 

4 0 5 月 09 日 消息 快 评 \r\n\r\in 深度 报告 5 H 09 日 消息 快 评 \r\n 
权威 内 参 来 自 “ 证 券 通 "WWW.... Win 深度 RE... 


from cPickle import dump, load 
dump(df, open('data/tmdf.pickle', 'wb')) 


from sklearn.feature_extraction.text import TfidfVectorizer 

vect = TfidfVectorizer(ngram_range=(1,1), min_df = 2, max_featur 
es = 10000) 

xvec = vect.fit_transform(df.seg_word) 

xvec. shape 


(17910, 10000) 


y = df.label 


from sklearn.cross_validation import train_test_split 

from sklearn.linear_model import LogisticRegression 

from sklearn.naive bayes import MultinomialNB 

from sklearn.ensemble import RandomForestClassifier 

train X, test X, train y, test y - train test split(xvec, y , tr 
ain size=0./7, random state=1) 

clf = MultinomialNB( ) 


clf.fit(train_X, train_y) 


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) 


from sklearn import metrics 
pre = clf.predict(test_X) 
print metrics.classification_report(test_y, pre) 


precision recall f1-score support 
0 0.90 0.88 0.89 577 
1 0.89 0.81 0.85 603 
2 0.88 0.82 0.85 619 
3 0.98 0.97 0.98 584 
4 0.86 0.88 0.87 570 
5 0.88 0.79 0.83 600 
6 0.77 0.90 0.83 600 
7 0.76 0.83 0.80 615 
8 0.92 0.93 0.93 605 


avg / total 0.87 0.87 0.87 5373 


txt = df.seg word.values 

txtlist = [] 

for sent in txt: 
temp = [w for w in sent.split()] 
txtlist.append(temp) 


num_features = 100 
min_word_count = 10 
num_workers = 4 
context = 5 

epoch = 20 

sample = 1e-5 


from gensim.models import word2vec 


model = word2vec.Word2Vec(txtlist, workers = num_workers, 
sample = sample, 
size = num features, 
min_count=min_word_count, 
window = context, 
iter = epoch) 


model.syn0.shape 


(57675, 100) 


for w in model.most_similar(u'4ixM'): 
print w[0], w[1] 


网 络 0.787674069405 
门户 网 站 0.747487425804 
搜索 引擎 0.744884610176 
无 线 0.732329308987 
B2B 0.713720798492 

网 络 广告 0.712735056877 
腾讯 0.702631175518 
MSN 0.701346695423 

大 旗 网 0.69608104229 
Google 0.68867880106 


def sentvec 1(sent,m=num features, model=model): 
res = np.zeros(m) 
words = sent.split() 
num = O 
for w in words: 
if w in model. index2word: 
res += model [w] 
num += 1.0 
1f num == 0: return np.zeros(m) 
else: return res/num 


n = df.shape[0] 

sent matrix = np.zeros([n,num_features], float ) 

for 1 ,sent in enumerate(df.seg_word.values): 
sent_matrix[i,:] = sentvec_1(sent) 

sent_matrix.shape 


(17910, 100) 


from sklearn.ensemble import GradientBoostingClassifier 

from sklearn.cross_validation import train_test_split 

train X, test X, train y, test y = train test split(sent matrix, 
y , train size=0.7, random state=1) 

clf - GradientBoostingClassifier() 


clf.fit(train X, train y) 

from sklearn import metrics 

pre - clf.predict(test X) 

print metrics.classification report(test y, pre) 


precision recall f1-score support 
0 0.91 0.85 0.88 577 
1 0.84 0.82 0.83 603 
2 0.86 0.88 0.87 619 
3 0.98 0.97 0.98 584 
4 0.84 0.86 0.85 570 
5 0.84 0.80 0.82 600 
6 0.86 0.87 0.87 600 
7 0.77 0.83 0.80 615 
8 0.93 0.93 0.93 605 


avg / total 0.87 0.87 0.87 5373 
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Single-layer neural network recap 

Introducing the multi-layer neural network architecture 
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Activating a neural network via forward propagation 
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Training neural networks via backpropagation 
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Obtaining the MNIST dataset 
Implementing a multi-layer perceptron 


e Training an artificial neural network 


e Debugging neural networks with gradient checking 


e Other neural network architectures 


o Convolutional Neural Networks 


o Recurrent Neural Networks 


Modeling complex functions with artificial 
neural networks 
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Single-layer neural network recap 
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Introducing the multi-layer neural network 
architecture 
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Multi-layer perceptron (MLP) 


MLP PUA RS XAVIER e E ARANA, 下 图 结构 可 以 写成 





1% Layer 200 Layer 3"d Layer 
(input layer) (hidden layer) (output layer) 


MLP learning procedure 
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1. Starting at the input layer, we forward propagate the patterns of the training 
data through the network to generate an output. 

2. Based on the network's output, we calculate the error that we want to 
minimize using a cost function that we will describe later. 

3. We backpropagate the error, find its derivative with respect to each weight in 
the network, and update the model. 


Activating a neural network via forward 
propagation 
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Loss functions 
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Training neural networks via backpropagation 
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Optimizaiton methods 
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Stochastic Gradient Descent (SGD) 


AO = -0 ) VE (2:10(4)) + HAP (1-1) 
p=, 


e Learning rate a: how large a step to take 
Momentum u: how important previous update is in calculating current update 
Decay: exponential rate of change of the leaming rate as a function of the number of iteration 


at each iteration: A(k) -一 O (x —1)/ (1 + decay + k) 


e Smoothing between steps 
Infer 2nd order information about optimization problem, like curvature 


Adaptive optimization algorithms adapt to the landscape and vary the parameters accordingly, performing 
parameterized scheduling with no human involvement 


Classifying handwritten digits 


Obtaining the MNIST dataset 


[back to top] 


The MNIST dataset is publicly available at http://yann.lecun.com/exdb/mnist/ and 
consists of the following four parts: 


e Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, 
60,000 samples) 
e Training set labels: train-labels-idx1-ubyte.gz (29 KB, 60 KB unzipped, 60,000 


labels) 

e Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, 10,000 
samples) 

e Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, 10,000 
labels) 


In this section, we will only be working with a subset of MNIST, thus, we only need 
to download the training set images and training set labels. After downloading the 
files, | recommend unzipping the files using the Unix/Linux gzip tool from the 
terminal for efficiency, e.g., using the command 


gzip *ubyte.gz -d 


in your local MNIST download directory, or, using your favorite unzipping tool if 
you are working with a machine running on Microsoft Windows. The images are 
stored in byte form, and using the following function, we will read them into 
NumPy arrays that we will use to train our MLP. 


import os 
import struct 
import numpy as np 


def load_mnist(path, kind='train'): 

"""Load MNIST data from ‘path """ 

labels path = os.path.join(path, 
'%s-labels-1idx1-ubyte' 
% kind) 

images path = os.path.join(path, 
'%S -images-1dx3-ubyte' 
% kind) 


with open(labels path, 'rb') as lbpath: 
magic, n = struct.unpack('>II', 
lbpath.read(8)) 
labels = np.fromfile(lbpath, 
dtype=np.uint8) 


with open(images_path, 'rb') as imgpath: 
magic, num, rows, cols = struct.unpack(">IIII", 
imgpath.read(16)) 
images - np.fromfile(imgpath, 
dtype-np.uint8).reshape(len(labels) 
, 184) 


return images, labels 


X train, y train = load mnist('data/mnist', kind='train' ) 


print('Rows: %d, columns: %d' 96 (X train.shape[0], X train.shape[ 


11)) 
E AA | 


Rows: 60000, columns: 784 


X test, y test = load mnist('data/mnist', kind='t10k') 
print('Rows: %d, columns: %d' 96 (X test.shape[0], X test.shape[1 
1)) 


Rows: 10000, columns: 784 


Visualize the first digit of each class: 


# Import pandas as pd 
# train = pd.read csv('data/digit recognizer/train.csv') 
# train.shape 


import matplotlib.pyplot as plt 
%matplotlib inline 


fig, ax = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True 
r) 
ax = ax.flatten() 
for i in range(10): 
img = X train[y train == 1][0].reshape(28, 28) 
ax[1].imshow(img, cmap='Greys', interpolation='nearest' ) 


ax[0].set_xticks([]) 

ax[0].set_yticks([]) 

plt.tight layout() 

# plt.savefig('./figures/mnist all.png', dpi-300) 








07004 


Visualize 25 different versions of "7": 





fig, ax = plt.subplots(nrows=5, ncols=5, sharex=True, sharey=True 


r) 


ax = ax.flatten() 

for i in range(25): 
img = X train[y train == 7][i].reshape(28, 28) 
ax[1].imshow(img, cmap='Greys', interpolation='nearest') 


ax[0].set_xticks([]) 
ax[0].set_yticks([]) 
plt.tight layout() 

# plt.save DS 


E EM 


217171217 
74127421217. 





> 


Uncomment the following lines to optionally save the data in CSV format. 
However, note that those CSV files will take up a substantial amount of storage 
space: 


e train_img.csv 1.1 GB (gigabytes) 

e train_labels.csv 1.4 MB (megabytes) 
e test img.csv 187.0 MB 

e test labels 144 KB (kilobytes) 


# np.savetxt('train img.csv', X train, fmt='%1', delimiter=',') 
# np.savetxt('train labels.csv', y train, fmt='%1', delimiter=', 
) 

# X train = np.genfromtxt('train img.csv', dtype=int, delimiter= 
1) 

# y train = np.genfromtxt('train labels.csv', dtype=int, delimit 
er=',') 


# np.savetxt('test img.csv', X test, fmt='%1', delimiter=',') 

# np.savetxt('test labels.csv', y test, fmt='%1', delimiter=',') 
# X test = np.genfromtxt('test img.csv', dtype-int, delimiter-', 
) 

# y test = np.genfromtxt(' test labels.csv', dtype=int, delimiter 
=, ) 


Implementing a multi-layer perceptron 
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import numpy as np 
from scipy.special import expit 
import sys 


class NeuralNetMLP(object): 
""" Feedforward neural network / Multi-layer perceptron clas 


神经 网 络 


sifier. 


Parameters 


n_output : int 
Number of output units, should be equal to the 
number of unique class labels. 


n_features : int 
Number of features (dimensions) in the target dataset. 
Should be equal to the number of columns in the X array. 


n_hidden : int (default: 30) 
Number of hidden units. 


11 : float (default: 0.0) 
Lambda value for L1-regularization. 
No regularization if 11=0.0 (default) 


12 : float (default: 0.0) 
Lambda value for L2-regularization. 
No regularization if 12=0.0 (default) 


epochs : int (default: 500) 
Number of passes over the training set. 


eta : float (default: 0.001) 
Learning rate. 


alpha : float (default: 0.0) 
Momentum constant. Factor multiplied with the 
gradient of the previous epoch t-1 to improve 
learning speed 
w(t) := w(t) - (grad(t) + alpha*grad(t-1)) 


decrease_const : float (default: 0.0) 
Decrease constant. Shrinks the learning rate 


after each epoch via eta / (1 + epoch*decrease_const) 


shuffle : bool (default: False) 
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les, 


Shuffles training data every epoch if True to prevent circ 


minibatches : int (default: 1) 
Divides training data into k minibatches for efficiency. 
Normal gradient descent learning if k=1 (default). 


random_state : int (default: None) 


Set random state for shuffling and initializing the weight 


Attributes 
605 _ : list 
Sum of squared errors after each epoch. 


def — init (self, n output, n features, n hidden-30, 
11=0.0, 12=0.0, epochs=500, eta=0.001, 
alpha=0.0, decrease_const=0.0, shuffle=True, 
minibatches=1, random_state=None): 


np.random.seed(random_state) 
self.n_output = n_output 
self.n_features = n_features 
self.n_hidden = n_hidden 

self.w1, self.w2 = self. initialize weights() 
self.11 = 11 

self.12 = 12 

self.epochs = epochs 

self.eta = eta 

self.alpha = alpha 
self.decrease_const = decrease const 
self.shuffle = shuffle 
self.minibatches = minibatches 


def encode labels(self, y, k): 
"""Encode labels into one-hot representation 


Parameters 
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def 


y : array, shape = [n_samples] 
Target values. 


Returns 


onehot : array, shape = (n_labels, n_samples) 


onehot = np.zeros((k, y.shape[0])) 

for idx, val in enumerate(y): 
onehot[val, idx] = 1.0 

return onehot 


_initialize weights(self): 
"""Initialize weights with small random numbers.""" 
w1 = np.random.uniform(-1.0, 1.0, size=self.n_hidden* (se 


lf.n features + 1)) 


wi = wi.reshape(self.n hidden, self.n features + 1) 
w2 - np.random.uniform(-1.0, 1.0, size=self.n_output*(se 


lf.n hidden + 1)) 


def 


def 


* (1-S9) 


w2 = w2.reshape(self.n_output, self.n_hidden + 1) 
return w1, w2 


_sigmoid(self, z): 
"""Compute logistic function (sigmoid) 


Uses scipy.special.expit to avoid overflow 
error for very small input values z. 


# return 1.0 / (1.0 + np.exp(-z)) 
return expit(z) 


_sigmoid_gradient(self, z): 
"""Ccompute gradient of the logistic function""" 
sg - self. sigmoid(z) 


return sg * (1- sg) # sigmoid 有 函数 的 导数 比较 简单 ， 就 是 sg 


神经 网 络 


def 


o" 1111 


01) 


def 


_add_bias_unit(self, X, how='column'): 


"""Add bias unit (column or row of is) to array at index 


if how == ‘column’: 
X new = np.ones((X.shape[0], X.shape[1]+1)) 
X new[:, 1:] = X 
elif how -- 'row': 
X new = np.ones((X.shape[0]+1, X.shape[1])) 
X_new[1:, :] = X 
else: 
raise AttributeError(' how must be column or row 


return X_new 


_feedforward(self, X, w1, w2): 
"""Compute feedforward step 


Parameters 


X : array, shape - [n samples, n features] 
Input layer with original features. 


w1 : array, shape - [n hidden units, n features] 
Weight matrix for input layer -» hidden layer. 


w2 : array, shape - [n output units, n hidden units] 
Weight matrix for hidden layer -» output layer. 


Returns 


al : array, shape = [n samples, n features-*1] 
Input values with bias unit. 


22 : array, shape = [n hidden, n samples] 
Net input of hidden layer. 


a2 : array, shape - [n_hidden+1, n samples] 
Activation of hidden layer. 


23 : array, shape = [n output units, n samples] 
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Net input of output layer. 


a3 : array, shape = [n_output_units, n_samples] 
Activation of output layer. 


al = self. add bias unit(X, how='column') 
22 = w1.dot(al.T) 

a2 = self. sigmoid(z2) 

a2 = self. add bias unit(a2, how='row') 
z3 - w2.dot(a2) 

a3 = self. sigmoid(z3) 

return al, z2, a2, z3, a3 


def L2 reg(self, lambda, wi, w2): 
"""Compute L2-regularization cost""" 
return (lambda /2.0) * (np.sum(w1[:, 1:] ** 2) + np.sum( 
W2 TS uS 2) 


def L1 reg(self, lambda, wi, w2): 


"""Compute L1-regularization cost""" 
return (lambda_/2.0) * (np.abs(wi[:, 1:]).sum() + np.abs 


(w2[:, 1:]).sum()) 


def _get_cost(self, y_enc, output, w1, w2): 
"""Compute cost function. 


y_ enc : array, shape = (n_labels, n_samples) 
one-hot encoded class labels. 


output : array, shape = [n_output_units, n_samples] 
Activation of the output layer (feedforward) 


w1 : array, shape = [n hidden units, n features] 
Weight matrix for input layer -» hidden layer. 


w2 : array, shape = [n output units, n hidden units] 
Weight matrix for hidden layer -» output layer. 


Returns 
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cost : float 
Regularized cost. 


termi = -y enc ^ (np.log(output)) 

term2 = (1 - y enc) * np.log(1 - output) 
cost = np.sum(termi - term2) # 22% 

L1 term = self. L1 reg(self.11, wi, w2) 
L2 term - self. L2 reg(self.12, w1, w2) 
cost = cost + L1 term + L2 term 

return cost 


get gradient(self, al, a2, a3, Z2, y enc, w1, w2): 
""" Compute gradient step using backpropagation. 


Parameters 


al : array, shape = [n samples, n_features+1] 
Input values with bias unit. 


a2 : array, shape = [n hidden+1, n samples] 
Activation of hidden layer. 


a3 : array, shape - [n output units, n samples] 
Activation of output layer. 


22 : array, shape = [n hidden, n samples] 
Net input of hidden layer. 


y enc : array, shape - (n labels, n samples) 
one-hot encoded class labels. 


w1 : array, shape - [n hidden units, n features] 
Weight matrix for input layer -» hidden layer. 


w2 : array, shape = [n output units, n hidden units] 
Weight matrix for hidden layer -» output layer. 


Returns 
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grad1 : array, shape = [n_hidden_units, n_features] 
Gradient of the weight matrix wi. 


grad2 : array, shape - [n output units, n hidden units] 
Gradient of the weight matrix w2. 


4 backpropagation 

sigma3 - a3 - y enc 

z2 = self. add bias unit(z2, how='row') 

sigma2 - w2.T.dot(sigma3) * self. sigmoid gradient(z2) 
Sigma2 = sigma2[1:, :| 

grad1 = sigma2.dot(al) 

grad2 = sigma3.dot(a2.T) 


# regularize 
gradi[:, 1:] += (w1[:, 1:] ^ (self.11 + self.12)) 
grad2[:, 1:] += (w2[:, 1:] * (self.11 + self.12)) 


return grad1, grad2 


def predict(self, X): 
"""bDredict class labels 


Parameters 


X : array, shape = [n_samples, n_features] 
Input layer with original features. 


Returns: 


y pred : array, shape = [n samples] 
Predicted class labels. 


1f len(X.shape) != 2: 
raise AttributeError('X must be a [n_samples, n_feat 
ures] array.\n' 
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"Use X[:, None] for 1-feature cl 


assification,' 


‘\nor X[[1]] for 1-sample class 
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al, z2, a2, 23, a3 = self. feedforward(X, self.w1, self. 


y_pred = np.argmax(z3, axis=0) 
return y_pred 


fit(self, X, y, print_progress=False): 
""" Learn weights from training data. 


Parameters 


X : array, shape = [n samples, n features] 
Input layer with original features. 


y : array, shape - [n samples] 
Target class labels. 


print progress : bool (default: False) 
Prints progress as the number of epochs 
to stderr. 


Returns: 


self.cost - [] 
X data, y data = X.copy(), y.copy() 
y enc - self. encode labels(y, self.n output) 


delta wi prev np.zeros(self.w1.shape) 


delta_w2_prev = np.zeros(self.w2.shape) 


for 1 in range(self.epochs): 


# adaptive learning rate 


A E Q 


self.eta /= (1 + self.decrease_const*1) 


1f print_progress: 
sys.stderr.write('\rEpoch: %d/%d' % (1+1, self.e 


pochs)) 
sys.stderr.flush() 
1f self.shuffle: 
idx = np.random.permutation(y data.shape[0]) 
X data, y data - X data[idx], y data[idx| 
mini = np.array_split(range(y_data.shape[0]), self.m 
inibatches) 


for idx in mini: 


al, 22, a2, z3, a3 = self. feedforward(X[idx], s 
elf.w1, self.w2) 
cost - self. get cost(y enc-y enc[:, idx], 
output-a3, 
wi1-self.w1, 
w2-self.w2) 
self.cost .append(cost) 


gradi, grad2 = self. get gradient(al-al, a2=a2, 
a3=a3, Z2=22, 
y enc-y enc[:, 


idx], 
wi-self.w1, 
w2-self.w2) 
delta w1, delta w2 = self.eta * gradi, self.eta 
* grad2 
self.w1 -= (delta wi + (self.alpha ^ delta w1 pr 
ev)) 
self.w2 -= (delta w2 + (self.alpha * delta w2 pr 
ev)) 


delta w1 prev, delta w2 prev = delta wi, delta w 


return self 


Training an artificial neural network 
[back to top] 


nn = NeuralNetMLP(n_output=10, 
n_features=X_train.shape[1], 
n_hidden=50, 
12=0.1, 
11=0,0, 
epochs=1000, 
eta=0.001, 
alpha=0.001, 
decrease const=0.00001, 
minibatches=50, 
random state=1) 


nn.fit(X train, y train, print progress=True) 


Epoch: 1000/1000 


< main .NeuralNetMLP at 0x11854e7d0> 


plt.plot(range(len(nn.cost_)), nn.cost_) 
plt.ylim([0, 2000]) 

plt.ylabel('Cost') 

plt.xlabel('Epochs * 50') 

plt.tight layout() 
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D 10000 20000 30000 40000 50000 


Epachs * 50 


batches - np.array split(range(len(nn.cost )), 1000) 
cost ary - np.array(nn.cost ) 
cost avgs = [np.mean(cost ary[i]) for 1 in batches] 


plt.plot(range(len(cost avgs)), cost avgs, color- red') 
plt.ylim([0, 2000]) 

plt.ylabel('Cost') 

plt.xlabel('Epochs') 

plt.tight layout(); 
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y_train_pred = nn.predict(X_train) 
acc = np.sum(y_train == y_train_pred, axis=0) / X_train.shape[0] 
print('Training accuracy: %.2f%%' % (acc * 100)) 


Training accuracy: 0.00% 


y_test_pred = nn.predict(X_test) 
acc = np.sum(y test == y test pred, axis=0) / X_test.shape[0] 
print('Training accuracy: %.2f%%' % (acc * 100)) 


Training accuracy: 0.0096 


miscl_img = X test[y test != y_test_pred][:25] 
correct_lab = y_test[y_test != y_test_pred][:25] 
miscl_lab= y test pred[y test != y test pred][:25] 


fig, ax = plt.subplots(nrows=5, ncols-5, sharex=True, sharey=True 
r) 
ax = ax.flatten() 
for i in range(25): 
img = miscl img[i].reshape(28, 28) 


ax[1].imshow(img, cmap='Greys', interpolation='nearest') 
ax[1].set_title('%d) t: %d p: %d' % (1+1, correct lab[i], mi 
scl lab[i])) 


ax[0].set xticks([]) 
ax[0].set_yticks([]) 
plt.tight layout() 


plt.show( ) 





Debugging neural networks with gradient 
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import numpy as np 
from scipy.special import expit 
import sys 


class MLPGradientCheck(object): 
""" Feedforward neural network / Multi-layer perceptron clas 
sifier. 


Parameters 
n output : int 
Number of output units, should be equal to the 
number of unique class labels. 
n features : int 
Number of features (dimensions) in the target dataset. 
Should be equal to the number of columns in the X array. 
n hidden : int (default: 30) 
Number of hidden units. 
11 : float (default: 0.0) 
Lambda value for L1-regularization. 
No regularization if 11=0.0 (default) 
12 : float (default: 0.0) 
Lambda value for L2-regularization. 
No regularization if 12=0.0 (default) 
epochs : int (default: 500) 
Number of passes over the training set. 
eta : float (default: 0.001) 
Learning rate. 
alpha : float (default: 0.0) 
Momentum constant. Factor multiplied with the 
gradient of the previous epoch t-1 to improve 
learning speed 
w(t) := w(t) - (grad(t) + alpha*grad(t-1)) 
decrease_const : float (default: 0.0) 
Decrease constant. Shrinks the learning rate 
after each epoch via eta / (1 + epoch*decrease_const) 
shuffle : bool (default: False) 
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Shuffles training data every epoch if True to prevent ci 


rcles. 


hts. 


minibatches : int (default: 1) 
Divides training data into k minibatches for efficiency. 
Normal gradient descent learning if k=1 (default). 
random_state : int (default: None) 
Set random state for shuffling and initializing the weig 


Attributes 
cost : list 
Sum of squared errors after each epoch. 


def _ init (self, n output, n features, n_hidden=30, 
1120.0, 12=0.0, epochs=500, eta=0.001, 
alpha=0.0, decrease_const=0.0, shuffle=True, 
minibatches=1, random_state=None): 


np.random.seed(random_state) 
self.n_output = n_output 
self.n_features = n_features 
self.n_hidden = n_hidden 

self.w1, self.w2 = self. initialize weights() 
self.11 - 11 

self.12 = 12 

self.epochs = epochs 

self.eta = eta 

self.alpha = alpha 
self.decrease_const = decrease const 
self.shuffle = shuffle 
self.minibatches = minibatches 


def encode labels(self, y, k): 


"""Encode labels into one nor representation 
Parameters 


y : array, shape = [n_samples] 
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Target values. 


Returns 


onehot : array, shape - (n labels, n samples) 


onehot = np.zeros((k, y.shape[0])) 

for idx, val in enumerate(y): 
onehot[val, idx] = 1.0 

return onehot 


_initialize weights(self): 

"""Initialize weights with small random numbers.""" 

w1 = np.random.uniform(-1.0, 1.0, 
size=self.n_hidden*(self.n_featur 


w1 = w1.reshape(self.n_hidden, self.n_features + 1) 

w2 = np.random.uniform(-1.0, 1.0, 
size=self.n_output*(self.n_hidden 

w2 = w2.reshape(self.n_output, self.n_hidden + 1) 


return wi, w2 


_sigmoid(self, z): 
"""Compute logistic function (sigmoid) 


Uses scipy.special.expit to avoid overflow 
error for very small input values z. 


# return 1.0 / (1.0 + np.exp(-z)) 
return expit(z) 


_sigmoid_gradient(self, z): 
"""Compute gradient of the logistic function""" 
sg = self._sigmoid(z) 


return sg * (1.0 - sg) 


_add_bias_unit(self, X, how='column'): 
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"""Add bias unit (column or row of is) to array at index 


if how == ‘column’: 
X new = np.ones((X.shape[0], X.shape[1] + 1)) 
X new[:, 1:] = X 
elif how -- 'row': 
X new = np.ones((X.shape[0]+1, X.shape[1])) 
X new[i:, :] =X 
else: 


raise AttributeError(' how must be column or row 


return X new 


 feedforward(self, X, w1, w2): 
"""Compute feedforward step 
Parameters 


X : array, shape = [n samples, n features] 
Input layer with original features. 

w1 : array, shape - [n hidden units, n features] 
Weight matrix for input layer -» hidden layer. 

w2 : array, shape - [n output units, n hidden units] 
Weight matrix for hidden layer -» output layer. 


Returns 

al : array, shape = [n samples, n_features+1] 
Input values with bias unit. 

22 : array, shape = [n hidden, n samples] 
Net input of hidden layer. 

a2 : array, shape - [n_hidden+1, n samples] 
Activation of hidden layer. 

z3 : array, shape = [n output units, n samples] 
Net input of output layer. 

a3 : array, shape - [n output units, n samples] 
Activation of output layer. 


al = self. add bias unit(X, how='column') 
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22 = w1.dot(al.T) 

a2 = self. sigmoid(z2) 

a2 = self. add bias unit(a2, how='row') 
z3 - w2.dot(a2) 

a3 = self. sigmoid(z3) 

return al, z2, a2, z3, a3 


L2 reg(self, lambda , wi, w2): 

"""Compute L2-regularization cost""" 

return (lambda_/2.0) * (np.sum(w1[:, 1:] ** 2) + 
np.sum(w2[:, 1:] ** 2)) 


_L1 reg(self, lambda_, wi, w2): 

"""Compute L1-regularization cost""" 

return (lambda_/2.0) * (np.abs(w1[:, 1:]).sum() + 
np.abs(w2[:, 1:]).sum()) 


get cost(self, y enc, output, wi, w2): 
"""Compute cost function. 


Parameters 

y enc : array, shape = (n labels, n samples) 
one-hot encoded class labels. 

output : array, shape - [n output units, n samples] 
Activation of the output layer (feedforward) 

w1 : array, shape - [n hidden units, n features] 
Weight matrix for input layer -» hidden layer. 

w2 : array, shape = [n output units, n hidden units] 
Weight matrix for hidden layer -» output layer. 


Returns 


cost : float 
Regularized cost. 


termi = -y enc ^ (np.log(output)) 
term2 = (1.0 - y enc) * np.log(1.0 - output) 
cost = np.sum(termi - term2) 
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L1 term = self. L1 reg(self.11, wi, w2) 
L2 term - self. L2 reg(self.12, w1, w2) 
cost = cost + L1 term + L2 term 

return cost 


get gradient(self, al, a2, a3, Z2, y enc, w1, w2): 
""" Compute gradient step using backpropagation. 


Parameters 

al : array, shape = [n samples, n features+1|] 
Input values with bias unit. 

a2 : array, shape - [n_hidden+1, n samples] 
Activation of hidden layer. 

a3 : array, shape - [n output units, n samples] 
Activation of output layer. 

22 : array, shape = [n hidden, n samples] 
Net input of hidden layer. 

y enc : array, shape = (n labels, n samples) 
one-hot encoded class labels. 

w1 : array, shape - [n hidden units, n features] 
Weight matrix for input layer -» hidden layer. 

w2 : array, shape = [n output units, n hidden units] 
Weight matrix for hidden layer -» output layer. 


Returns 

gradi : array, shape - [n hidden units, n features] 
Gradient of the weight matrix wi. 

grad2 : array, shape - [n output units, n hidden units] 
Gradient of the weight matrix w2. 


4 backpropagation 

sigma3 - a3 - y enc 

z2 = self. add bias unit(z2, how='row') 

sigma2 = w2.T.dot(sigma3) * self. sigmoid gradient(z2) 
Sigma2 = sigma2[1:, :| 

gradi = sigma2.dot(al) 

sigma3.dot(a2.T) 


grad2 
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# regularize 
gradi[:, 1:] += (w1[:, 1:] * (self.11 + self.12)) 
grad2[:, 1:] += (w2[:, 1:] * (self.11 + self.12)) 


return grad1, grad2 


def _gradient_checking(self, X, y_enc, w1, w2, epsilon, grad 
1, grad2): 
""" Apply gradient checking (for debugging only) 


Returns 
relative_error : float 
Relative error between the numerically 
approximated gradients and the backpropagated gradient 


num gradi = np.zeros(np.shape(w1)) 
epsilon_ary1 = np.zeros(np.shape(w1)) 
for 1 in range(w1.shape[0]): 
for j in range(w1.shape[1]): 
epsilon _ary1[1, j] = epsilon 
al, 22, a2, z3, a3 = self. feedforward(X, 
w1 - epsi 
lon aryi, w2) 
cost1 = self. get cost(y enc, a3, wi-epsilon ary 
1, w2) 
al, 22, a2, z3, a3 = self. feedforward(X, 
w1 + epsi 
lon aryi, w2) 
cost2 = self. get cost(y enc, a3, w1 + epsilon a 
ry1, w2) 
num gradi[i, j] = (cost2 - costi) / (2.0 * epsil 
on) 
epsilon_ary1[1, j] = 0 


num grad2 - np.zeros(np.shape(w2)) 
epsilon ary2 - np.zeros(np.shape(w2)) 


for 1 in range(w2.shape[0]): 


lon_ary2) 


on_ary2) 


lon_ary2) 


on_ary2) 
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for j in range(w2.shape[1]): 
epsilon ary2[i, j] = epsilon 
al, 22, a2, z3, a3 = self. feedforward(X, wi, 
w2 - epsi 


cost1 = self. get cost(y enc, a3, wi, w2 - epsil 


al, 22, a2, z3, a3 = self. feedforward(X, wi, 
w2 + epsi 


cost2 = self._get_cost(y_enc, a3, w1, w2 + epsil 


num grad2[i, j] = (cost2 - costi) / (2.0 * epsil 


epsilon ary2[i, j] = 0 


num grad = np.hstack((num gradi.flatten(), num grad2.fla 


tten( ))) 


grad = np.hstack((grad1.flatten(), grad2.flatten())) 
normi = np.linalg.norm(num_grad - grad) 


norm2 = np.linalg.norm(num_grad) 


norm3 = np.linalg.norm(grad) 


relative error = normi / (norm2 + norm3) 


return relatlve error 


def predict(self, X): 
"""Predict class labels 


Parameters 


: array, shape = [n_samples, n_features] 


Input layer with original features. 


Returns: 


y_pred : array, shape = [n_samples] 


Predicted class labels. 


1f len(X.shape) != 2: 
raise AttributeError('X must be a [n_samples, n_feat 
ures] array.\n' 
"Use X[:,None] for 1-feature cl 
assification,' 
‘\nor X[[1]] for 1-sample class 
ification' ) 


al, z2, a2, z3, a3 = self. feedforward(X, self.w1, self. 
W2) 

y_pred = np.argmax(z3, axis=0) 

return y_pred 


def fit(self, X, y, print_progress=False): 
""" Learn weights from training data. 


Parameters 

X : array, shape - [n samples, n features] 
Input layer with original features. 

y : array, shape - [n samples] 
Target class labels. 

print progress : bool (default: False) 
Prints progress as the number of epochs 
to stderr. 


Returns : 


self.cost_ = [|] 
X data, y data = X.copy(), y.copy() 
y enc - self. encode labels(y, self.n output) 


delta w1 prev - np.zeros(self.w1.shape) 
delta w2 prev - np.zeros(self.w2.shape) 


for 1 in range(self.epochs): 


pochs)) 


inibatches) 


idx], 


1, 


ae 


self.eta /= (1 + self.decrease const*i) 


1f print_progress: 
sys.stderr.write('\rEpoch: %d/%d' % (1+1, self.e 


sys.stderr.flush() 

1f self.shuffle: 
idx = np.random.permutation(y data.shape[60]) 
X data, y enc = X data[idx], y enc[idx] 


mini = np.array_split(range(y_data.shape[0]), self.m 


for idx in mini: 


al, 22, a2, z3, a3 = self. feedforward(X[idx], 


self.wi, 
self .w2) 
cost = self. get cost(y enc-y enc[:, idx], 
output-a3, 
wi1-self.w1, 
w2-self.w2) 


self.cost .append(cost) 


gradi, grad2 = self. get gradient(al=al, a2=a2, 
a3-a3, Zz2-22, 
y enc-y enc[:, 


wi1-self.w1, 


w2-self.w2) 


grad diff = self. gradient checking(X=X data[idx 


y enc-y enc[ 


wi-self.w1, 


w2-self.w2, 
epsilon=1e-5 


gradi=gradi, 


grad2=grad2) 


1f grad diff <= 1e-7: 
print('Ok: %s' % grad diff) 
elif grad diff <= 1e-4: 
print('Warning: %s' 96 grad diff) 
else: 
print('PROBLEM: %s' 96 grad diff) 


weights; [alpha * delta w prev| for mom 
delta w1, delta w2 = self.eta * gradi, self.eta 
self.w1 -= (delta wi + (self.alpha ^ delta w1 pr 
self.w2 -- (delta w2 + (self.alpha * delta w2 pr 


delta w1 prev, delta w2 prev = delta wi, delta w 


return self 


nn check = MLPGradientCheck(n output=10, 


n_features=X_train.shape[1], 
n hidden=10, 
12=0,0, 

11=0,0, 

epochs=10, 
eta-0.001, 
alpha=0.0, 
decrease_const=0.0, 
minibatches=1, 
shuffle=False, 
random state=1) 


nn check.fit(X train[:5], y train[:5], print progress=False) 


Ok: 
Ok: 
Ok: 
Ok: 
Ok: 
Ok: 
Ok: 
Ok: 
Ok: 
Ok: 


N 0 NM FN WW WN N DN 


.59699590792e-10 
.95535281/5e-10 
.38060/54028e-10 
.0//60/91451e-10 
.38/42154283e-10 
.5/890531092e-10 
.1/69/90214/e-10 
.361/1066/91e-10 
.42158139292e-10 
. 1065//4/496e-10 


< main .MLPGradientCheck at 0x1288634d0> 


Other neural network architectures 
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Convolutional Neural Networks 
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Recurrent Neural Networks 
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What is Deep Learning? 


[back to top] 


Deep learning is a particular kind of machine learning that achieves great 
power and flexibility by learning to represent the world as a nested hierarchy 
of concepts, with each concept defined in relation to simpler concepts, and 
more abstract representations computed in terms of less abstract ones. 


|. Goodfellow, Y. Bengio, and A. Courville, "Deep Learning." Book in preparation 
for MIT Press, 2016. 
http://www.deeplearningbook.org/ 


Representation Learning 


Use machine learning to discover not only the mapping from representation to 
output but also the representation itself. 


|. Goodfellow, Y. Bengio, and A. Courville, "Deep Learning." Book in preparation 
for MIT Press, 2016. 
http://www.deeplearningbook.org/ 


What is TensorFlow 
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Tensor Flow 





( 


TensorFlow™ js an open source software library for numerical computation 
using data flow graphs. Nodes in the graph represent mathematical 
operations, while the graph edges represent the multidimensional data arrays 
(tensors) communicated between them. The flexible architecture allows you to 
deploy computation to one or more CPUs or GPUs in a desktop, server, or 
mobile device with a single API. TensorFlow was originally developed by 
researchers and engineers working on the Google Brain Team within 
Google's Machine Intelligence research organization for the purposes of 
conducting machine learning and deep neural networks research, but the 
system is general enough to be applicable in a wide variety of other domains 


as well. 


e A TensorFlow computation is described by a directed graph, which is 
composed of a set on nodes 

e Each node represents the instantiation of an operation 

e An operation represents an abstract computation (e.g., "matrix multiply", or 

"add") 

Clients programs interact with the TensorFlow system by creating a Session 

Computations represented as a dataflow graph where tensors flow along the 


graph edges 


What is a Data Flow Graph 


[back to top] 


Data flow graphs describe mathematical computation with a directed graph of 
nodes & edges. Nodes typically implement mathematical operations, but can 
also represent endpoints to feed in data, push out results, or read/write 
persistent variables. Edges describe the input/output relationships between 
nodes. These data edges carry dynamically-sized multidimensional data 
arrays, or tensors. The flow of tensors through the graph is where TensorFlow 
gets its name. Nodes are assigned to computational devices and execute 
asynchronously and in parallel once all the tensors on their incoming edges 


becomes available. 
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TensorFlow examples 


TENSORFLOW IN ONE SLIDE 


import numpy as np STANDARD IMPORTS 


import tensorflow as tf 


X = tf.placeholder("float", [None, input_dim]) PLACEHOLDERS 
Y = tf.placeholder("float", [None, output dim]) FOR OUR DATA 
beta = tf.Variable(tf.random_normal(beta_shape, stddev=0.01)) | PARAMETERS TO LEARN 
def model(X, beta): 
# some function of X and beta SOME PARAMETRIC MODEL 
p_yx = model(X, beta) APPLIED TO THE SVMBOLIC VARIABLES 
cost = some cost function(p_yx, Y) 
train op = tf.train.SomeOptimizer.minimize(cost) TRAIN BY MINIMIZING SOME COST FUNCTION 
with tf.Session() as sess: 
sess.run(tf.initialize all variables()) CREATE SESSION AND INITIALIZE VARIABLES 
for _ in range(num epochs): 
sess.run(train_op, feed dict-(X: trX, Y: trY)) | TRAIN USING DATA 


[back to top] 


Introduction 


[back to top] 


Hello world 


[back to top] 


e Session: TensorFlow 十 在 session 中 运行 computation graph 
e Fetches: 在 session 中 执行 run()， 可 以 fetch 得 到 operation 的 结果 


# Simple hello world using TensorFlow 
import tensorflow as tf 


# Create a Constant op 

# The op is added as a node to the default graph. 

F 

# The value returned by the constructor represents the output 
# of the Constant op. 

hello = tf.constant('Hello, TensorFlow!') 


# Start tf session 
sess = tf.Session() 


# Run graph 


print sess.run(hello) 


Hello, TensorFlow! 


Basic Operations 


[back to top] 
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import tensorflow as tf 


O o dt dk dut 


H 


Basic constant operations 

The value returned by the constructor represents the output 
of the Constant op. 

tf.constant(2) 

tf.constant(3) 


Launch the default graph. 


with tf.Session() as sess: 


print "a=2, b=3" 
print "Addition with constants: %1" 96 sess.run(a+b) 
print "Multiplication with constants: %1" % sess.run(a*b) 


a-2, b=3 
Addition with constants: 5 


Multiplication with constants: 6 


Feed: 在 sess.run() 中 传 入 feed dict 参数 可 以 向 对 应 的 节点 咀 入 数据 
placeholder 作为 一 个 占 位 符 ， 是 数据 输入 的 端点 ， 必 须要 在 run() PRAGE 


Basic Operations with variable as graph input 

The value returned by the constructor represents the output 
of the Variable op. (define as input when running session) 
tf Graph input 

tf.placeholder(tf.int16) 

tf. .placeholder(tf.int16) 

placeholder 是 信息 输入 的 端点 

在 sess 中 通过 feed dict Až XR XE 


d: dE O DO dt dt HF dt 


4 Define some operations 

add tf.add(a, b) 

mul tf.mul(a, b) 

# 这 里 定义 的 操作 是 象征 性 的 ，sess.run ZETZELENB TERR 


# Launch the default graph. 
with tf.Session() as sess: 

# Run every operation with variable input 

print "Addition with variables: %1" % sess.run(add, feed dic 
t={a: 2, b: 3}) 

print "Multiplication with variables: %1" % sess.run(mul, fe 
ed dict-(a: 2, b: 3}) 


Addition with variables: 5 
Multiplication with variables: 6 


# More in details: 
# Matrix Multiplication from TensorFlow official tutorial 


Create a Constant op that produces a 1x2 matrix. The op is 
added as a node to the default graph. 


The value returned by the constructor represents the output 


E db Gk +Y + 


of the Constant op. 
matrix1 = tf.constant([[3., 3.]]) 


# Create another Constant that produces a 2x1 matrix. 
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matrix2 = tf.constant([[2.],[2.]]) 


# Create a Matmul op that takes ‘matrixi' and 'matrix2' as input 
SE 

# The returned value, 'product', represents the result of the ma 
trix 

# multiplication. 

product = tf.matmul(matrix1, matrix2) 


# To run the matmul op we call the session 'run()' method, passi 
ng ‘product’ 

# which represents the output of the matmul op. This indicates 
to the call 

# that we want to get the output of the matmul op back. 

E 

# All inputs needed by the op are run automatically by the sessi 
on. They 

# typically are run in parallel. 

E 

# The call 'run(product)' thus causes the execution of threes op 
s 1n the 

# graph: the two constants and matmul. 

e 
F 


The output of the op is returned in 'result' as a numpy `ndarr 
ay object. 
with tf.Session() as sess: 
result = sess.run(product) 
print result 


[[ 12.]] 


Basic Models 


Linear Regression 
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[back to top] 
Variable: 


e variable 是 在 计算 图 中 可 训练 的 量 
e 在 session 中 必须 先 要 初始 化 
e name 参数 可 定义 variable 在 graph 中 的 名 称 


import tensorflow as tf 

import numpy as np 

import matplotlib.pyplot as plt 
%matplotlib inline 


# Parameters 
learning_rate = 0.01 
training_epochs = 1000 
display_step = 100 


# Training Data 
train X = np.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7. 
59 2:167, 

77042, VO 701.5.313,.1,997.5:6054.9.27/.9.1 
1) 
train Y = np.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596, 
2.53,1.221, 

2.827,3.465,1.65,2.904,2.42,2.94,1.3]) 
n samples = train X.shape[96] 


[C El 





# tf Graph Input 
X - tf.placeholder("float") 
Y - tf.placeholder("float") 


# Set model weights 
W = tf.Variable(np.random.randn(), name="weight" ) 
b = tf.Variable(np.random.randn(), name="bias" ) 


# Construct a linear model 
pred = tf.add(tf.mul(X, W), b) 


# Mean squared error 

cost = tf.reduce sum(tf.pow(pred-Y, 2))/(2*n samples) 

# Gradient descent 

optimizer = tf.train.GradientDescentOptimizer(learning rate).min 
imize(cost) 


# Initializing the variables 
init = tf.initialize all variables() # 定义 初始 化 操作 


# Launch the graph 
with tf.Session() as sess: 
sess.run(init) # variable # session 中 必须 先 初始 化 


# Fit all training data 
for epoch in range(training epochs): 
for (x, y) in zip(train X, train Y): 
sess.run(optimizer, feed dict-(X: x, Y: y)) # ru 
n 一 次 optimizer» 进行 一 次 梯度 下 降 


#Display logs per epoch step 
if (epoch+1) % display_step == 
C = sess.run(cost, feed dict-(X: train X, Y:train_Y) 


print "Epoch:", '%04d' % (epoch+1), "costz", "{:.9f}" 
.format(c), \ 
"W=", sess.run(W), "b=", sess.run(b) 


print "Optimization Finished!" 

training cost = sess.run(cost, feed dict-(X: train X, Y: tra 
in Y) 

print "Training cost=", training cost, "W=", sess.run(W), "b 
=", sess.run(b), ‘\n' 


#Graphic display 
plt.plot(train_X, train_Y, 'ro', label='Original data' ) 


plt.plot(train X, sess.run(W) * train X + sess.run(b), label= 
"Fitted line') 

plt.legend() 

plt.show() 








Epoch: 0050 cost= 0.104400784 0.157368 1.46493 
Epoch: 0100 cost= 0.101245113 0.162853 1.42547 
Epoch: 0150 cost- 0.098453194 0.168012 1.38836 
Epoch: 0200 cost- 0.095982887 0.172864 1.35345 
Epoch: 0250 cost- 0.093797326 0.177427 1.32062 
Epoch: 0300 cost- 0.091863610 0.18172 b- 1.28975 

Epoch: 0350 cost- 0.090152740 0.185757 1.26071 
Epoch: 0400 cost- 0.088638850 0.189554 1.23339 
Epoch: 0450 cost- 0.087299488 0.193125 1.2077 

Epoch: 0500 cost- 0.086114518 0.196483 1.18354 
Epoch: 0550 cost- 0.085066028 0.199641 1.16082 
Epoch: 0600 cost- 0.084138311 0.202612 1.13945 
Epoch: 0650 cost- 0.083317406 0.205406 1.11935 
Epoch: 0700 cost- 0.082590967 0.208034 1.10044 
Epoch: 0750 cost- 0.081948124 0.210505 1.08266 
Epoch: 0800 cost- 0.081379279 0.21283 b- 1.06594 

Epoch: 0850 cost- 0.080875866 0.215017 b- 1.05021 
Epoch: 0900 cost- 0.080430366 0.217073 b- 1.03542 
Epoch: 0950 cost- 0.080036096 0.219007 b- 1.0215 

Epoch: 1000 cost- 0.079687178 0.220826 b- 1.00842 


Optimization Finished! 


Training cost- 0.0796872 W- 0.220826 b- 1.00842 


€ è Original data 
— Fitted line 





Logistic Regression 
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import tensorflow as tf 


# Import MINST data 
from tensorflow.examples.tutorials.mnist import input_data 
mnist = input data.read data sets("/tmp/data/", one_hot=True) 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/t10k-images-1dx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


# Parameters 
learning_rate = 0.01 
training_epochs = 25 
batch_size = 100 
display_step = 5 


# tf Graph Input 

x = tf.placeholder(tf.float32, [None, 784]) + mnist data image o 
f shape 28*28=784 

y = tf.placeholder(tf.float32, [None, 10]) + 0-9 digits recognit 
ion => 10 classes 


# Set model weights 
W = tf.Variable(tf.zeros([/84, 10])) 
b = tf.Variable(tf.zeros([10])) 


# Construct model 
pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax 


# Minimize error using cross entropy 

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_i 
ndices=1)) + tf.reduce mean X44 np.mean() 

# Gradient Descent 

optimizer - tf.train.GradientDescentOptimizer(learning rate).min 
imize(cost) 


# Initializing the variables 
init - tf.initialize all variables() 


# Launch the graph 
with tf.Session() as sess: 
sess.run(init) 


# Training cycle 
for epoch in range(training_epochs): 
avg_cost = 0. 
total_batch = int(mist.train.num_examples/batch_size) 
# Loop over all batches 
for 1 in range(total_batch): 
batch_xs, batch_ys = mnist.train.next_batch(batch_si 


Ze) 
# Fit training using batch data 
_, € = sess.run([optimizer, cost], feed dict-(x: bat 
ch xs, 
y: bat 
ch_ys}) 


# Compute average loss 
avg cost += c / total batch 
# Display logs per epoch step 
if (epoch+1) % display step == 0: 
print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}" 
.format(avg_cost) 


print "Optimization Finished!" 


# Test model 
correct prediction = tf.equal(tf.argmax(pred, 1), tf.argmax( 


y, 1)) 
# Calculate accuracy for 3000 examples 


accuracy = tf.reduce mean(tf.cast(correct prediction, tf.flo 
at32)) 

print "Accuracy:", accuracy.eval({x: mnist.test.images[:3000 
|, y: mnist.test.labels[:3000]}) 








Epoch: 0005 cost= 0.465507779 
Epoch: 0010 cost= 0.392393045 
Epoch: 0015 cost= 0.362739271 
Epoch: 0020 cost= 0.345433382 


Epoch: 0025 cost= 0.333723887 
Optimization Finished! 
Accuracy: 0.888333 


Neural Networks 


[back to top] 


Multilayer Perceptron 
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from tensorflow.examples.tutorials.mnist import input_data 
mnist = input data.read data sets("/tmp/data/", one_hot=True) 


import tensorflow as tf 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/t10k-images-idx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


# Parameters 
learning_rate = 0.001 
training_epochs = 15 
batch_size = 100 
display_step = 5 


# Network Parameters 

n_hidden_1 = 256 # 1st layer number of features 
n_hidden_2 = 256 # 2nd layer number of features 

n input = 784 + MNIST data input (img shape: 28*28) 
n_classes = 10 # MNIST total classes (0-9 digits) 


# tf Graph input 
tf.placeholder("float", [None, n_input]) 
tf.placeholder( "float", [None, n classes]) 


4 Create model 
def multilayer perceptron(x, weights, biases): 
# Hidden layer with RELU activation 
layer 1 = tf.add(tf.matmul(x, weights['hi']), biases['b1']) 
layer 1 - tf.nn.relu(layer 1) 
# Hidden layer with RELU activation 
layer 2 = tf.add(tf.matmul(layer 1, weights['h2']), biases[' 
b2 1) 
layer 2 = tf.nn.relu(layer 2) 
# Output layer with linear activation 
out layer = tf.matmul(layer 2, weights| out ]) + biases['out' 


return out_layer 


HEN MD) 


weights = 4 
‘h1': tf.Variable(tf.random normal([n input, n_hidden_1])), 
‘h2': tf.Variable(tf.random normal([n hidden 1, n hidden 2]) 
), 
‘out': tf.Variable(tf.random normal([n hidden 2, n_classes]) 
) 
} 
biases = { 
‘b1': tf.Variable(tf.random normal([n hidden 1])), 
‘b2': tf.Variable(tf.random normal([n hidden 21])), 
‘out': tf.Variable(tf.random normal([n classes])) 


pred - multilayer perceptron(x, weights, biases) 


cost = tf.reduce mean(tf.nn.softmax cross entropy with logits(pr 


ed, y)) 
optimizer - tf.train.AdamOptimizer(learning rate-learning rate). 


minimize (cost) 


init - tf.initialize all variables() 


# Launch the graph 
with tf.Session() as sess: 
sess.run(init) 


# Training cycle 
for epoch in range(training_epochs): 
avg_cost = 0. 
total_batch = int(mist.train.num_examples/batch_size) 
# Loop over all batches 
for 1 in range(total_batch): 
batch_x, batch_y = mnist.train.next_batch(batch_size 


# Run optimization op (backprop) and cost op (to get 
loss value) 
_, € = sess.run([optimizer, cost], feed dict={x: bat 
ch_x, 
y: bat 
ch_y}) 
# Compute average loss 
avg cost += c / total batch 
# Display logs per epoch step 
if epoch % display step == 
print "Epoch:", '%04d' % (epoch+1), "cost=", \ 
"f{:.9f}".format(avg_cost) 
print "Optimization Finished!" 


# Test model 

correct prediction = tf.equal(tf.argmax(pred, 1), tf.argmax( 
y, 1)) 

# Calculate accuracy 

accuracy = tf.reduce mean(tf.cast(correct prediction, "float" 


)) 


print "Accuracy:", accuracy.eval({x: mnist.test.images, y: m 
nist.test.labels}) 








Epoch: 0001 cost= 166.660608705 
Epoch: 0006 cost= 9.265776215 
Epoch: 0011 cost= 2.211729868 
Optimization Finished! 
Accuracy: 0.9434 


Convolutional Neural Network 
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import tensorflow as tf 


from tensorflow.examples.tutorials.mnist import input_data 
mnist = input data.read data sets("/tmp/data/", one_hot=True) 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/ti10k-images-idx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


# Parameters 
learning_rate = 0.001 
training_iters = 200000 
batch_size = 128 
display_step = 200 


# Network Parameters 

n input = 784 + MNIST data input (img shape: 28*28) 
n_classes = 10 # MNIST total classes (0-9 digits) 
dropout = 0.75 + Dropout, probability to keep units 


# tf Graph input 

x = tf.placeholder(tf.float32, [None, n input]) 

y - tf.placeholder(tf.float32, [None, n_classes]) 

keep prob = tf.placeholder(tf.float32) #dropout (keep probabilit 


y) 


4 Create some wrappers for simplicity 
def conv2d(x, W, b, strides-1): 

# Conv2D wrapper, with bias and relu activation 

x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], pad 
ding='SAME' ) 

x = tf.nn.bias add(x, b) 

return tf.nn.relu(x) 


def maxpool2d(x, k=2): 
4 MaxPool2D wrapper 
return tf.nn.max pool(x, ksize=[1, k, k, i], strides=[1, k, 
k, 1], 
padding='SAME' ) 


# Create model 
def conv_net(x, weights, biases, dropout): 
# Reshape input picture 
x = tf.reshape(x, shape=[-1, 28, 28, 1]) 


# Convolution Layer 

convi = conv2d(x, weights['wc1'], biases[ 'bci']) 
# Max Pooling (down-sampling) 

conv1 = maxpool2d(conv1, k=2) 


# Convolution Layer 

conv2 = conv2d(conv1, weights['wc2'], biases['bc2']) 
# Max Pooling (down-sampling) 

conv2 = maxpool2d(conv2, k=2) 


# Fully connected layer 

# Reshape conv2 output to fit fully connected layer input 

fci = tf.reshape(conv2, [-1, weights[ wdi'].get shape().as 1 
ist()[9]]) 

fci = tf.add(tf.matmul(fci, weights[ wdi']), biases['bd1']) 

fci = tf.nn.relu(fc1) 

# Apply Dropout 

fc1 = tf.nn.dropout(fc1, dropout) 


# Output, class prediction 
out = tf.add(tf.matmul(fc1, weights['out']), biases['out']) 
return out 


weights = { 
‘wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])), 
'"wc2': tf.Variable(tf.random normal([5, 5, 32, 64])), 
‘wd1': tf.Variable(tf.random normal([/*7*64, 1024])), 


‘out': tf.Variable(tf.random normal([1024, n classes])) 


‘bc1': tf.Variable(tf.random normal([32])), 
‘bc2': tf.Variable(tf.random normal([64])), 
‘bdl': tf.Variable(tf.random_normal([1024])), 
‘out': tf.Variable(tf.random normal([n classes])) 


pred - conv net(x, weights, biases, keep prob) 


cost = tf.reduce mean(tf.nn.softmax cross entropy with logits(pr 


ed, y)) 
optimizer - tf.train.AdamOptimizer(learning rate-learning rate). 


minimize (cost) 


correct pred - tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1)) 
accuracy = tf.reduce mean(tf.cast(correct pred, tf.float32)) 


init - tf.initialize all variables() 


# Launch the graph 
with tf.Session() as sess: 
sess.run(init) 
step = 1 
# Keep training until reach max iterations 
while step * batch_size < training_iters: 
batch_x, batch_y = mnist.train.next_batch(batch_size) 
# Run optimization op (backprop) 
sess.run(optimizer, feed dict=(x: batch x, y: batch y, 
keep prob: dropout}) 
if step % display step == 
# Calculate batch loss and accuracy 
loss, acc = sess.run([cost, accuracy], feed dict=(x: 


batch x, 


batch y, 
ke 


ep prob: 1.}) 
print "Iter " + str(step*batch size) + ", Minibatch 


Loss= " + \ 
"{:.6f}".format(loss) + ", Training Accuracy- " 


+ \ 
"{:.5f}".format(acc) 


step += 1 
print "Optimization Finished!" 


# Calculate accuracy for 256 mnist test images 
print "Testing Accuracy:", \ 


sess.run(accuracy, feed dict={x: mnist.test.images[:256] 


y: mnist.test.labels[:256] 


keep prob: 1.}) 


EE Uf 














Iter 25600, Minibatch Loss= 1453.969238, Training Accuracy= 0.87 
500 

Iter 51200, Minibatch Loss= 0.000000, Training Accuracy= 1.00000 
Iter 76800, Minibatch Loss= 836.579651, Training Accuracy= 0.914 
06 

Iter 102400, Minibatch Loss= 265.563293, Training Accuracy= 0.96 
6/5 

Iter 128000, Minibatch Loss= 120.997910, Training Accuracy= 0.99 
219 

Iter 153600, Minibatch Loss= 29.434311, Training Accuracy= 0.976 
56 

Iter 179200, Minibatch Loss= 248.191101, Training Accuracy- 0.98 
438 

Optimization Finished! 

Testing Accuracy: 0.984375 


Recurrent Neural Network LSTM 


[back to top] 


import tensorflow as tf 
from tensorflow.python.ops import rnn, rnn_cell 
import numpy as np 


from tensorflow.examples.tutorials.mnist import input_data 
mnist = input data.read data sets("/tmp/data/", one_hot=True) 


Extracting /tmp/data/train-images-idx3-ubyte.gz 
Extracting /tmp/data/train-labels-idx1-ubyte.gz 
Extracting /tmp/data/t10k-images-1dx3-ubyte.gz 
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz 


# Parameters 
learning_rate = 0.001 
training_iters = 100000 
batch_size = 128 
display_step = 100 


# Network Parameters 

n_input = 28 # MNIST data input (img shape: 28*28) 
n_steps = 28 # timesteps 

n_hidden = 128 # hidden layer num of features 
n_classes = 10 # MNIST total classes (0-9 digits) 


# tf Graph input 
x = tf.placeholder( "float", [None, n steps, n_input]) 
y = tf.placeholder("float", [None, n_classes]) 


# Define weights 
weights - ( 
‘out': tf.Variable(tf.random normal([n hidden, n_classes])) 
} 
biases = ( 
‘out': tf.Variable(tf.random_normal([n_classes])) 


def RNN(x, weights, biases): 


# Prepare data shape to match rnn function requirements 

# Current data input shape: (batch_size, n_steps, n_input) 

# Required shape: 'n steps' tensors list of shape (batch siz 
e, n input) 


Permuting batch size and n steps 

= tf.transpose(x, [1, 0, 2]) 

Reshaping to (n_steps*batch_size, n_input) 
= tf.reshape(x, [-1, n_input]) 


dE x E x E 


Split to get a list of 'n_steps' tensors of shape (batch_s 
ize, n input) 
x = tf.split(0, n steps, x) 
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lstm_cell = rnn_cell.BasicLSTMCell(n_hidden, forget_bias=1.0 
, state is tuple-True) 


outputs, states - rnn.rnn(lstm cell, x, dtype-tf.float32) 
return tf.matmul(outputs[-1], weights['out']) + biases[ out' 
pred = RNN(x, weights, biases) 


cost = tf.reduce mean(tf.nn.softmax cross entropy with logits(pr 


ed, y)) 
optimizer - tf.train.AdamOptimizer(learning rate-learning rate). 


minimize (cost) 


correct pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y,1)) 
accuracy = tf.reduce mean(tf.cast(correct pred, tf.float32)) 


init - tf.initialize all variables() 


# Launch the graph 
with tf.Session() as sess: 
sess.run(init) 
step = 1 
# Keep training until reach max iterations 
while step * batch_size < training_iters: 
batch_x, batch_y = mnist.train.next_batch(batch_size) 
# Reshape data to get 28 seq of 28 elements 
batch_x = batch_x.reshape((batch_size, n_steps, n_input) 


) 
# Run optimization op (backprop) 
sess.run(optimizer, feed dict=(x: batch x, y: batch yl) 
if step % display step == 
# Calculate batch accuracy 
acc = sess.run(accuracy, feed dict=(x: batch x, y: b 
atch yl) 
# Calculate batch loss 
loss = sess.run(cost, feed dict=(x: batch x, y: batc 
h yj) 
print "Iter " + str(step*batch size) + ", Minibatch 
Loss= " + \ 
"{:.6f}".format(loss) + ", Training Accuracy- " 
+ \ 


'"{:.5f}".format(acc) 
step += 1 
print "Optimization Finished!" 


# Calculate accuracy for 128 mnist test images 


test_len = 128 
test data = mnist.test.images[:test len].reshape((-1, n step 


s, n input)) 
test label - mnist.test.labels[:test len] 
print "Testing Accuracy:", \ 
sess.run(accuracy, feed dict={x: test data, y: test labe 








Iter 12800, Minibatch Loss= 
Iter 25600, Minibatch Loss= 
Iter 38400, Minibatch Loss= 
Iter 51200, Minibatch Loss= 
Iter 64000, Minibatch Loss= 
Iter 76800, Minibatch Loss= 
Iter 89600, Minibatch Loss= 
Optimization Finished! 


. 75000 
.87500 
.164333, Training Accuracy= 0.92969 


.716396, Training Accuracy= 0 
0 
0 
.143476, Training Accuracy= 0.92969 
0 
0 
0 


.367348, Training Accuracy= 


. 96094 
. 90625 
. 98438 


.193304, Training Accuracy= 
.202645, Training Accuracy= 


ONO O O TOTO 


.056868, Training Accuracy= 


Testing Accuracy: 0.992188 


使 用 卷 积 神经 网 络 做 文本 分 类 


from os import path 
import os 

import re 

import codecs 
import pandas as pd 
import numpy as np 


from cPickle import dump, load 


df = load(open('data/tmdf.pickle','rb')) 


df.head() 


label 
0 0 
1 | 0 
2 0 
3 0 
4 0 
textraw = 


textraw = 


txt 


本 报 记 者 陈 雪 频 实 习 记 者 唐 翔 发 自 上 
inn 一 家 刚 册 刚 成 立 两 年 的 网 络 
支付 公司 ， 它 的 目标 是 ... 


证 券 通 : 百联 股份 未 来 5 年 有 能 力 保 
持 高 速 增长 \r\n\r\in 深度 报告 权威 内 
A 


5 月 09 日 消息 快 评 \r\n\r\n 深度 报告 
权威 内 参 来 自 “ 证 券 通 ”WWW.... 
5 月 09 日 消息 快 评 \r\n\rin 深度 报告 
权威 内 参 来 自 “ 证 券 通 "Www 


5 月 09 日 消息 快 评 \r\n\r\in 深度 报告 
权威 内 参 来 自 “ 证 券 通 "WWW 


df.seg_word.values.tolist() 


seg_word 


本 报 记 者 MEM 实习 记 
者 Pr 上 海 \r\n 


一 家 刚刚 MA... 
TEx GH: 百联 股份 Rena 
5 年 有 能 力 保 持 高 速 增 
KW... 

5 A 09 日 消息 RTF Wn 
Wn 深度 dE... 
5 H 09 H HA Br Nn 
Wn 深度 dE... 
5 月 09 日 消息 RF Wn 


Win 深度 ARE... 


[ line.encode('utf-8') for line in textraw] 


lo - 늬 


maxfeatures = 


20000 + 


from keras.preprocessing.text import Tokenizer 


token = 


Tokenizer(nb_words= A 


token.fit on texts(textraw) : 3 
usos T 


text seq = 


np.median( 


498.0 


[len(x) for x in text sedg]) 


y = df.label.values 


nb classes 


- len(np. unique(y)) 


print(nb_classes) 


from | future__ import absolute import 

from keras.optimizers import RMSprop 

from keras.preprocessing import sequence 

from keras.models import Sequential 

from keras.layers.core import Dense, Dropout, Activation, Flatte 
n 

from keras.layers.embeddings import Embedding 

from keras.layers.convolutional import Convolution1D, MaxPooling 
1D 

from keras.layers.recurrent import SimpleRNN, GRU, LSTM 

from keras.callbacks import EarlyStopping 


maxlen = 600 4 定义 文本 最 大 长 度 
batch size = 32 4 MX 

word dim = 100 4 词 向 量 维度 

nb filter = 200 # 疮 积 核 个 数 
filter length = 10 4 RAU KJ 
hidden dims = 50 # 隐藏 层 神经 元 个 数 
nb_epoch = 10 H MARA É 
pool length = 50 # 1kun kK 


from sklearn.cross_validation import train_test_split 
train_X, test_X, train_y, test_y = train_test_split(text_seq, y 
, train_size=0.8, random_state=1) 


H 转 为 等 长 矩阵 ， 长 度 为 naxlLen 

print("Pad sequences (samples x time)") 

X train = sequence.pad sequences(train X, maxlen=maxlen, padding= 
"post', truncating-'post') 

X test = sequence.pad sequences(test X, maxlen=maxlen, padding='p 
ost', truncating-'post') 

print('X train shape:', X train.shape) 

print('X test shape:', X test.shape) 


Pad sequences (samples x time) 
('X train shape:', (14328, 600)) 
('X test shape:', (3582, 600)) 


from keras.utils import np utils 
Y train - np utils.to categorical(train y, nb classes) 
Y test - np utils.to categorical(test y, nb classes) 


import tensorflow as tf 
tf.python.control flow ops - tf 


print('Build model...') 
model = Sequential() 


opout=0.25)) 

model.add(Convolution1D(nb_filter=nb_filter, 
filter_length=filter_length, 
border_mode="valid", 
activation="relu")) 


model .add(MaxPooling1D(pool_length=pool_length) ) 
model.add(Flatten()) 


model.add(Dense(hidden_dims)) 

model.add(Dropout(0.25)) 

model.add(Activation('relu')) 

model.add(Dense(nb_classes)) 

model.add(Activation('softmax')) 
model.compile(loss='categorical_crossentropy', optimizer='rmspro 
p',metrics-["accuracy"]) 


Build model... 


earlystop = EarlyStopping(monitor='val_loss', patience=1, verbos 
e=1) 
result = model.fit(X_train, Y_train, batch_size=batch_size, nb_e 
poch=nb_epoch, 

validation_split=0.1, callbacks=|earlystop]) 


/Users/xiaokai/anaconda/envs/tensorflow/lib/python2.7/site-packa 
ges/keras/models.py:603: UserWarning: The "show accuracy" argume 
nt is deprecated, instead you should pass the "accuracy" metric 
to the model at compile time: 

"model.compile(optimizer, loss, metrics-["accuracy"])' 
warnings.warn('The "show accuracy" argument is deprecated, ' 
/Users/xiaokai/anaconda/envs/tensorflow/lib/python2.7/site-packa 
ges/tensorflow/python/ops/gradients.py:90: UserWarning: Converti 
ng sparse IndexedSlices to a dense Tensor of unknown shape. This 

may consume a large amount of memory. 
"Converting sparse IndexedSlices to a dense Tensor of unknown 
shape. " 


Train on 12895 samples, validate on 1433 samples 

Epoch 1/10 

12895/12895 | ==============================] - 704s - loss: 1.43 
06 - val loss: 0.5532 

Epoch 2/10 

12895/12895 [==============================| - 7724s - loss: 0.4 
912 - val loss: 0.4273 

Epoch 3/10 

12895/12895 | =========="="""="=="="=============] - 765s - loss: 0.35 
11 - val loss: 0.4003 

Epoch 4/10 

12895/12895 | ==============================] - 807s - loss: 0.25 
71 - val loss: 0.4114 

Epoch 5/10 

12864/12895 [============================>, | - ETA: 5s - loss: O 
.1971 Epoch 00004: early stopping 

12895/12895 | =========="="="="="================] - 2285s - loss: 0.1 
968 - val_loss: 0.4415 


score = earlystop.model.evaluate(X_test, Y_test, batch_size=batc 
h size) 

print('Test score:', score) 

classes - earlystop.model.predict classes(X test, batch size-bat 
ch size) 

acc - np utils.accuracy(classes, test y) 

print('Test accuracy:', acc) 


3582/3582 | ==============================] - 738 
('Test score:', 0.4292584941548252) 
3582/3582 | ==============================] Zoe 


("Test accuracy: ', 0.89056393076493578) 


